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ABSTRACT / 

This handbook treats a restricted set Qf st 
procedures for addressing some of the most prevalent technt 
that arise in domain-referenced Jtestlna. The procedures dis 
here were chosen because thev do' not necessitate extensive 
computations^^ The five maior^r sections of the paper cover: ( 
analysis procedures for using data to help identify items t 
flawed: (2) /b. simple procedure for establishina a cutting s 
a procedure for establishina an advancement score; («) two 
error tha* can be jnade when a decision about an examin'ee is 
^he examinee's observed score rather ^than his or her univer 
and (5\ a number of issues associated with assessing the qu 
dom.ain-referenced measurement procedures for a group of exa 
(Author/BW) 
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. When domain-' rofaranaad (or orifcorlon-reitox-Gncad) taat ijuj w^h eoriTi^illy 
introduced almoat 20 years Ago, theve were many who predicted Uo quick 
demiae. They were wrong, and dramatically ao. There are m^ny reaaona 
why domain-referenced jteating haa aurvived' and, indeed, flouri^ahed oven 
in the face of »ome rcither vocal criticisms. One reaaon (and in thia 
author,*?^ opinion, the principal reaaon) ia that -practitionera aimply 
refuaed to let domain-referenced teating die. Eapecially in inatructionai 
and training contexta, many practitionera felt that traditional norm- 
referenced approachea to meaaurement aimply did not ^'ddress is/aues of 
pr'incipal interest to them. By contrast, domain-referenced testing 
seemed to address such issues; and even if the answers provided were 
imperfect, at least the issues addressed were judged relevant. In 
looking favorably upon domain-referenced testing, such practitioners 
were not disparaging nonn-referenced testing per se — they wek-e simply 
arguing that it was not necessarily the best approach in all contexts. 

Broadly speaking, .the literature on domain-referenced testing has . 
evolved in two directions — literature dealing with iteC-^nd test con- 
struction, and literature dealing with technical measurement issues. 
Much of the test development literature' has been written for practi- 
tioners, but a great deal of the technical measurement literature is 
writteq at a level well beyond the background and experience ^J'typical ' 
practitioners. This handbook is intended to help bridge "this gap*. 

' . . / ^ ■ ...•-'■^ ^ ■ 
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i\\ \\\\it\ iiaiKlbook, lUU htJt\ I lutvf^ tiulactotl <x atsi:. ol pi;ocuclur«ti ''thrtt. I 
thinik ^"hanS^to^jethor" tu4Uoimb ly wul l in a payaliomuUi ic uenae , " althoiujh *• 
x\o completoly conti Ui :tu\t. aet of prnnaci\u«H U ourirant ly aVfillabltJ , , in 
my "opinion. .Much judqmenta about: inclvujion and uxt^iuiiion of procadurtiB 
tire admitted lyV^peii^ to criticism, but faiiuru to maku auch judgments 
wbuid rendor thia handbook much too involved arui computationally com--. 
plicatQd. \ . 

*• In an attempt to minimize computatiofTal requirementa , and to simplify 
both the description and use of these procedures, I h*ave occasionally 
found it necessary to modify (or extend) an e'Scisting procedure. Also, 
I have developed computationally simplSeaj ve^rsi^s of certain statistics, 
and I have generated tables that hopefully facilitate the application 
of certain, procedures. Otherwise, however, the procedures discussed 

9 . \ ' . > ^ 

- are not new; rather, they ar^ occasionally reformulated, frequently 
simplified, and intentionally integrated.X 

, . ^ - : ; ■ ■ ' 

I wish to express my gratitude to . the Navy Persd?inel Research and. 
Development Center, and especially Dr. Pat-Anthony Federico, for supporting 
the development of this. handbook. Also, I sincerely appr&piate the'many 
helpful comments I have received about an earlier. draft of this handbook 
from Dr. Michael T. Kane, Dr. D. R. Divgi , Dr. Ross Traub, and several 
of my colleagues at ACT. Finally:^! am ve^ grateful to Ms.. Wanda Hawkins 
.^pr the excellent job she has done in setting up tables and equations, 
and typing this manuscript. ' 

^ * V , ' ^ . . ' Robert L. Brennan 
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. ^able 1.1 ' ' 
Formulas for Calculating Sample Wap's, Variances, and Standard Deviations 



Formulas 



"Let X v= observed mean score fcy^ pea^sori p 

^ (proportion of items/correct) 

^ k = /number of persons 

Z = a symbol meaning "si^ Ithe scores" 

Calculate 

^~^p ^ ""Sum pf mean scores for k persons 

E = sum of squared mean scores for^ 

^ k persons 



Example 



Suppose k = 6 persons have the following observed 
mean scores : .6, .8, .8, .8, .9 



E X = .6 + . 8 4 -8; + N.8 + .9 + .9 = 4 .80 
Z x2 = .36 + .64 .64 + .64 + .81 + .81 = 3.90 



Sample Mean:. ' 



(1.1) jc = E X /k 

y P 



X = 4.80/6 = .80. 



Sample Variance: 



E x^ 



(1:.2) 



s^(x ) = 



""2 
- x^ 



3.90 



s2(x ) = 



(.80) 



2 . 



.010 



Sample Standard Deviation > 
(1.3) 



s(x ) = Vs^TxT 
P • P 



s(x ) 

p 



10 = . 100 



Corrected Sample Variance 
(1.4) s2(x ) = s2(x ) 



k - 1 P 

Corrected Sample Standard Deviation 

(1.5) s(x ) = V s2(x ) 

P V n 



S^(X ) = 



6-1 



(.010) = .012 



s(x ) 
P 



012 = .110 
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However, as far as this handbook is concerned, the sole reason for 
choosfhg between s and s is to provide the simt>lest possible computa- 
tional procedures for estimating quantities of interest. (A similar 
statement holds for the corresponding standard deviations, s and s.) 

It was mentioned, above, that a standard deviation is a measure of 
the ^amount of spread or dispersion in a set of scores. To give the con- 
cepts^ of a standard deviation a more concrete interpretation, it^s common ' 
practice to consider "the standard deviation of a particular be;.l-shaped 
distribution of scores, called a normal distribution. As illustrated* 
in Figure 1.1, for a normal distribution: (a) 68% of the scores lie 
within' one standard deviation to the right and left of the mean; and (b) 
95% of the scores lie within two standard deviations to the right and 
left of the mean. These two statements also can Ids expressed in terms 
of what are called "z-scores." 

As indicated in Figure 1.1, a score that lies one standard deviation * 
above the mean can be denoted z = 1; and, a score that lies one staxidard 
deviation below the mean can be denoted z = -1. It follows that, for 
a normal distribution, 68% of the scores lie between z =^-1 and z = 1. 
Similarly, 95% of the scores lie between z = -2 a^ 'z = 2. 

The above statements about percent of cases between specified z-scpres 
"Qt apply to all possible distributions of scores. However, provided 
one does not interpret such statements too literally, they can properly 
serve as useful bench marks for conceptualizing the interpretation of a 
standard deviation. 



The reader is cautioned not to infer from the above paragraphs that 
Jiest scores are usually (or should be) normally distributed. Indeed, 
sfor 39!nain-referenced tests, it is quite comroon to have many high-scor- 
ing examinees and lielafively few low-scoring examinees; and such a dis- _ 
tribution is not normal. For this reason, most procedures treated in this 
handbook involve no assumption about the ^ shape of the scor^ distribution. 



Universe of Items 



. .A universe of items is a concept of cje.ntr^ importance for domain- 
referenced interpretations, because ultimately one wants to' make inferences 

\ ' ^ " ' ' " > » 

about examinee universe, or --domain, scores. (Considerations Tvith respect 

,to a universe cr^f items >arV ptominervt*' in some, approaches to^^ ri<?rTn-ref erenced 

interpretations, too, but norm-reterenced interpretations are' not within 

the ^cope of this handbobk.) ' , 

Sometimes there actually exists a set of items that can be considered 
as the intended universe. For example, some computer-managed instruction 
systems have a lar<^e bank of items that is used to construct specific t 
testsT^ Also, ' the wo^ds in a specified dictionary might constitute a 
universe for a spelling domain. 

More frequently, however, pragmatic concerns require that one concept- 
ualize a universe of items for the content under consideration. For e'xampl 
in the initial stages of developing a domain-referenced testing' system, 
it is likely that only a limited number of items will be available. Fur- 
thermore, for many content areas, it would be yirfeu^ly impossible to ' ^ 
construct all relevant items, or even a large proportion of such items. 



In .such cases, it is especially important that the intended universe be 
defined and described in as clear and unambiguous a manner ois possible. 
Otherwise, one cannot easily claim that a particular Item does, or does 
not, reference the intended domain; nor can one clearly specify what an 
examinee's universe score means. , ^ 

, No matter how a universe may be defined, in this handbook a test 
IS viewed as a sample of items from an intended, universe. Mor^' specif- 
ically, to be technically correct, we ought to say that a test is a random 
sample of items from the universe, in the sense that every item in the 
universe has an equal chance qf appearing in any test. In pracj^ice, one 
s^idpm has the opportunity to randomly select a sample of items, in the 
literal sen^;6e^f the word "randomly." However, if a universe is defined 
well enough,' then one can usuaHv, ensure that a' test consists of a reason- 

. ' . ^ ' ' / ■/ # . • ' ■ 'T > 

ably representative sample of d^tems from,, the intended unov^^^s^. 

. ■ ^ ' ■ ' ■' j' 

It can be argued that for every objective in a program or instruc- 
tional sequence, there o^ght ^to be a distinct universe of items. It is 
not uncommon, however, for a test to reference a universe that might be 
viewed as stratified, in the sense that the universe is defined by multiple 
objectives W the multiple categories in a table of specifications or 
task-content matrix. The procedures discussed in this handbook do not 
specifically incorporate^ considerations with respect to a universe defined 
in manner , even thqugh these procedures (or similar ones) are some- 

times used with such universes. 
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Overview ^ , 

No matter how well-defined a universe of items may be, the quality 

< 

of the decisions made can be no higher than the quality of the items 
themselves. Therefore, Section 2 considers some simple item analysis 
procedures for using data to help identify items that may be flawed. This 
topic is rather mundane, and the process of performing item analyses is 
tedious; but, in this author's opinion tfhe vamdity of a 'domain-referenced 
measurement procedure absolutely necessitates using good items that repre- 
sent a well-defined" universe of items. Furthermore, no after-the-fact 
statistical analysis of examinee test scores can overcome the negative 
impact of poor items on the quality of domain-referenced interpretations. 

Section 3 considers a rather simple procedure for establishing a 
cutting score , 



TT^ , expressed ^s a proportion of items correct for the 



universe of items. (In thi^handbobk the Greek letter tt is used to repre- 
sent a score for the universe of items, whereas k is used for a icore on , 
■a test, or sample of items f'rc5irMbh^ universe.) This „ procedure ii "content 
based" in t?he sense that it relies upon* the subjective (but, hopefully, 
well-informed) judgments of content-matter 'specialists. 

Sectio^i 4 treats a procedure for establishing an advancement score. 
Recall that a cutting score , tt^ , is expressed as a proportion of items 

correct for the universe of items; and, as such, tt is "similar" to an 
o 

examinee's universe score, tt, in the sense that both Tt an^d tt reference 
o 

the same universe of items. By contrast, an advancement score, x , is 

o 
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^imilar" to an examinee's obseVvfd score, x, in the Vensp that both 



ref>erence a test score. To put it another way^ an advancement score is 
an: observed score analogue of a cutting score, just as an examinee's test 
.score is an observed score analogue of his/her universe score. A decision 
concerning mastery is actually made with respect to the^^advancement score; 
.I.e., an examinee is declared a master if his/her observed score is at 
or above the advancement score. , « 

Section 5 considers two, typ^s of erior that can be made when a 
decision 'about an examinee, is based on the examinee 's observed score 
rather than his/her uniyerse score (which is never known). These two 
typies of error are called error of measurement and error of classifi- 
cation.- Error of measuremen^t^-nvolves the ejctent to which examinee ob- 
served and univei-se scores differ; and, as^such, error of measurement ' 
does not^ involve consideration of a cutting score. By pontrast/ an 
error of classification is made if .an examinee is erroneous'ly classified 

^s a master or erroneously classified as a non-master. * 

. . ' ' r ■ ^ ' 

Section 6 considers a number of issues associated with assessing 

• ■) 

the quality of domain-referenced measurement procedures for a group of 
examinees. These issues are, in part, related to traditional notions 
of reliability (or measurement consistency). Also, to an extent, these 
issues have a validity connotation, because in domain-referenced test- 
ing, examinee universe scores are a principal "criterion!' of interest. 

However, the terms "reliabi^lity" and "validity" are used only infre- 

. .., * ♦ * ■ 

quently in Section 6 because they too easily^\connote traditiqnal statis- , 

^ ^ . ' ■ ^ * * ) 

tical analyses (for norm-reif erenced interpretations) that are inappropriate 
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in domain-referenced measurement contexts. Rather, emphasis is placed 
upon certain agreement coefficients and group-based measures of error. 

Restrictions in Scope and Content ' 

Domaih-ref erenced measurement is currently a topic of considerable 
interest in numarjus applied settings, cind a hcindbook such as this 
cannot treat all relevant issues in all such settings. In particular, 
there are many important educational, philosophical, legal, ethical, 
artd technical issues involved in testing for licensure, certification,, 
"minimal" competency, etc. For the most part, such issues are not treated 
here; rather emphasis is placed upon procedure's that seem to this author 
to be both theoratically reasonable and capable of being used relatively 
easily by practitioners — especially practitioners in ^^structional and 
training environftients where nothing more sophisticated than a- simple, *' 
hcind-held calculator may be-, available. , 

Throughout this handbook it is ^ssumed that examinee responses are 
not corrected for* guessing. In several cases, the procedures discussed 
could be (or have been) modified in various ways to take guessing into 
account. Such modifications are not treated here for three reasons. 
.First, many such modifications make assumptions about guessing that the 
author believes are unrealistic. Second, reasonable assumptions about 
guessing involve complexities considerably bdyond the scope of this 
handbook. Third, it remains to be seen (in a research sense) whether 
or not procedures involving reasonable assumptions about guessing mater- 
ially improve the quality of decisions made in typical domain-referenced 
testing situations. 

21 



13 



In the field of statistics, distinctions are carefully drawn between 
quantities of principal interest, called parameters , and estimates of 
these quantities, called statistics . For theoretical work, this distinc- 
ti?on is crucial, but to incorporate this distinction in the body of this 
handbook would necessitate, ,at much more coimplicated notational system, 
as well as considerably more complex verbal statements. Therefore, the 

' ' i- ■ s 

term "statistic" is used in this handbook in a gener;ic sense (even thoug'h' 

t '. ■ 

occasionally the word "parameter" would be better, technically), and tKere 
is no notational distinction drawn between parameter.s and estimates:* 
Also, both quantities of principal interest and their estimates are 
usually denoted wit^i Greek letters to .distinguish them from the sample 
^statistics discussed conjunction with Table 1.1; Finally, concerning 
notational conventions, sometimes a symbol is underlined in the text for 
emphasis and/or to preclude mistaking it for part of a word or phrase. 

The body of this handbook does not contain references to published 
work, proofs of formulas and equations , or justifications for choosing the 
procedures treated here rather than others which might have been chosen. 
However, to a limited extent, these issues are treated in Appendix B, 
which is provided principally for the technically oriented reader. It 
will be evident to such a r reader that , in several cases, the treatments 
of procedures in the body of the handbook are slight modifications of 
procedures discussed in published literature. Such modifications were 
made principally for computational convenience. Furthermore, in a few 
instances procedures are presented, or suggestions are made, that have 
not been considered previously in published literature. 
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2. Item Analysis Considerations 
In domain- referenced testing (or any type of testing, for that 
matter) there is no substitute for good items. No statistical proce- 
dure cari overcome the negative effect of poor test items; but as dis- ^ 
cussed in this section, statistics can be used to help ident poor items. 

First, however, it must be emphasized that, prior to collecting any ' 
data, every effort must be made to insure that items reflect the objec- 
tives they are intended to measure and that the items have no' obvious 
technical flaws. Such judgments are best made .by content matter special- 
ists who have knowledge of item construction procedures and guidelines. 
If content-matter specialists do not have such knowledge then they 
should be aided in their judgments by someone who does. Also, items 
should be reviewed for potential bias by members of minority groups, 
especially 'when domain-ref erended tests are to be used with members 
of minority groups. 

Item Analysis Table and Statistics 

No matter how thoroughly content matter experts scrutinize items 
to eliminate flaws, it is always advisable to study examinee responses 
to items. Such data provide an additional check on item quality. Usually 
such data are displayed in the form of an item analysis table such as 
that provided in Table 2.1. 

To give a context to the synthetic data in Table 2.1, let ud assume 
that 10 items were administered to 50 examinees, and one vof these items 
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Table 2.1 



Illustration of an 
Item Analysis Table and Statistics 
Using Synthetic Data 







^ Subgroup^ 










Alternative 


Low 
(0-6) 


Medium^ 
(7-8) ' 


High 
'(9-10) 


Total 


P 


B . 


a 


J 


1 


2 


6 


.14 


-.13 


b* 


8 


' 9 


16 


33 


■ .75 


.18 


c 


2 


1 


1 


4 


ro9 


-.10 


d 


0 > 


0 


0 


0 


.00 


.00 


Omit 


0 


0 


1 


1 


.02 


• 05 


Not Reached 


3 


3 


0 


6 






Total 


16 


14 


20 


50 






Total minus 
Not Reached 




11 


20 


44 







(2.1) p = 



12.2) 3 



proportion of examinees who choose! 
alternative (or omitted item) 



e.g. 



proportion of examinees 
in high group who choose 
alternative (or omitted 
item) 



For the correct alternative, b, 

p = 33/44 = '.75 

B ^ (16/20) - (8/13) = .80 - 



proportion of examinees 
in low group who choose 
alternative (or omitted 
JLtem) 



62 = .18 



Numbers within parentheses indicate the scores (in terms of number 
of items correct) that fall into each group. 

Note . * indicates the correct (keyed) alternative. 
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Resulted in the data in Table 2.1. Table 2.1 indicates that this item 
contains four altei-natives with the correct (or keyed) alternative being 
b (the alternative that is starred) . Note that the other alter^iatives 
(namely a, c, aud d) are sometimes called distractors > or incorrect'' 
alternatives. 

To.s^udy examinee performance on an item, it is usual to classify 
the examinees into groups based on their test performance- In Table 
2.1 this has been accomplished by assigning each' examinee to: (a) a 
"low" group if he/she has 0-6 Items correct; (b) a "mediiitn" group if 
he/she has 7-8 items correct; or (c) a "high" group if he/she has 
9 - 10 items correct. For present purposes, the reader can assume that 
examinees in the high group would be judged "successful," those in. the 
Low group would be judged "unsuccessful," and those in the middle group, • 
might (or might not) be judged • ''successful , " 

The entries under the columns headed low, medium, and high are the 
numbers of examinees in each group who chose each alternative, omitted 
the item, or did not reach the item. The following procedure can be used 
to distinguish between an item that was omitted (but attempted) by an 
examinee and one that was not reached (and unattempted) : (a) if an 
examinee omitted the last item, assume that the examinee did not reach 
one item; (b) if the examinee omitted both of ttie last .two items assume 
that two items were not reached by the examinee; (c) if the examinee 
omitted all three of the last three litems, assume that three items were 




1,7 

not reached; etc. All other biank responses by an examinee can be treated 
as "omits." 

Table 2.1 also includes column totals ind;icatv.,q t:he total n.W,«x: at 
tjxamineas u> each ..roup, th« .uunber of examinees \n each uroup who 

reached the item. The 'row totals i-n Table 2.1 indicate the total number of 
examine4s who picked each alternative, omitted the iter., or did not reach 
the iten,, --.naUv, for each alternative, Vabie.:- . l^rovides two statistics 
which are identified as £ and B and defined m Equations 2 . 1; ^hd 2.2, respe 
tively. The statistic p will always have a value between 0 and ], .md B 
wiil, always be between -1 and +1. ■ ' 

The statistic p indicates ^e proportion of examinees who chose an 
alternative. For the correct Jlernative £ is called the item di ffic u lt y 
level, and it is the proport-i^j^L^aminees who. got the item correct. In 
Table 2.1, £ = .75 for the correct alternative. Not^e' that easv items have 
hi^h difficulty levels and hard items have low difficulty levels. 

The statistic B indig^tes the difference between the proportions' of 
examinees in the hi^h and low groups who chose -an alternative. For the 
correct alternative, B is called an item discrimination index . It reflects 
the difference between the proportion of examinees m the high group who 
got the Item correct and the pr&!id%ion in the low group who got the item 
correct, 

"sing Item Analysis Data . 

The principal use of item analysis data in domain-referenced testing 
situations is to detect flawed items. It must be understood, however, that 
such data-no matter how^efully analyzed-do not provide an absolute 
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indication that an it^m is or is not flawed. Also, if an item is flawed, 
the data cannot tell t^e investigator exactly how to correct the flaw. 
What the data can do is flag a potentially flawed item and usually sug- 
gest the nature of the problem and/or the part of the item that is flawed. 
Given this perspective," the following paragraphs provide some guidelines 
for examining item analysis data. 

(a) Have an actual copy of the item available when examining an 
item analysis table like that in Table 2.1. 

(b) Look at p for the correct alternative. The item may be flawed 
if the item difficulty level, p, is considerably out of line with a value 
one might expect. (Usually, in domain-referenced testing items have rel- 

•atively high difficulty levels if they are obtained for a group of exam- 
inees who have experienced instruction in the content tested.) 

(c) Look at the relationship between item difficulty level and the 
p values for the distractors. If a distractor has a value for £ that 

is above the item difficulty level, then, examine the distractor to see if 
in fact it could be considered, reasonably, as a correct answer. If so, 
""•^dne of three problems probably exist — the correct answer was mis-specified, 
the item has two or more correct answers, or the item is ambiguous. In any 
case, the item requires revision.. 

(d) If £ is very small for any distractor (e.g. , alternative d in. 
Table 2.1) consider eliminating it dt replacing it with some other incor- 
rect , alternative— provided doing^'so does not change the intended nature of 
the item. (Recall that if an item is inherently easy, it is ver:y likely 
that one or more distractors will be chosen infrequently.) 
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(e) Look at the item discrimination index (the value of B for the 
Correct alternative) . It is very unlikely that a good item would have 
a value for B that is noticeably negative, ^because that would mean that 
a greater proportion of thie low-scoring group got the item correct than 
the high-scoring group. Therefore, if B is noticeably negative (say, 
less than -.20) examine the item carefully, checking especially to see 
that the item was scored correctly, that it is unambiguous, and that 
the indicated correct cinswer is 'indeed correct. 

(f) Look at the values 'of B for the distractors. If any of them 
are noticeably positive (say, above .20), check the item to s^^ if it 
is ambiguous, or if the distractor could possibly be a correct answer. 

(g) If either p or B for "omits" is noticeably positive, examine the 
item for ambiguities. It is assimied, here, that examinees are not being 
penalized for guessing and, therefore, there is no extrinsic motivation 
for an examinee not to pick an alternative. 

(h) Consider the number of examinees (especially high-scoring 
examinees) who did not reach the item. If many examinees did not reach 
it, (e.g., see Table 2.1) the item may be all right, but it is likely that 
examinees were not allowed enough time when they were tested. Unless 

a domain-referenced test is intended to be speeded, examinees should 

V' 
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have a reaaonable amount of testing time. Otherwise, the examinees' 
scores will not adequately reflect theii;- afcrility. 

The above suggestions .should be regarded as reasonable "rules-of- 
thumb" — not dogmatic directives. No such rules, and no amount of item 
analysis data, absolve item developers and investigators from employing 
common sense and good judgment based on experience and content-matter 
knowledge. 



Other Considerations 




In norm-referenced testing conjzexts it is not uncommon for items 
to be discarded or revised if the value of a discrimination index is 
positive but^^mall. This criterion should not Be used in domain-ref- 
erenced testing contexts. Indeed, frequently in such contexts many 
good items are virtually guaranteed to have positive but small values 
for a discrimination index. Also, in norm- referenced testing contexts 
a high discrimination index is frequently viewed almost as an indicator 
of an ideal item. This perspective should not be taken in domain-ref- 
erenced testing contexts — at least not in the sense that highly discrim- 
inating items are preferred over moderately discriminating ones. In domain- 
referenced testing situations, emphasis is placed upon content, and discrim- 
ination indices should be used solely as an 'aid in identifying flawed items- 
not a basis *f or classifying items into degrees of quality. 

In an ideal world\ all items in the universe would undergo item 
analysis before any^ decisions were made about examinees based on any 

I. 



21 



.items 'in the universe. This ideal is seldom feasible in practice, 
Even so, no item should bo used aa a basis for makincj decisions 
about examinees until it has been subjected to an item analysip. To 
address this issue the following procedure can be used. First, in the 
initial stages of developing a universe of items, prior to using the 
items for decision-making, a reasonably large sample of them should 
undergo item analysis using a representative group of examinees. Items ^ 
that do not successfully clear this hur^ile should be discarded or revised. 
Second, 'to gather item analysis data on other available items, or items 
subsequently developed, one can include a small number of them "in opera- 
tional versions of domain-referenced tests. However, examinee scores 
on any such additional item should not be used as part of the examinee 
total scores for decision-making— at least not until the' item analysis 
data have been studied to verify that the item has no obvious flaws. 

If the above approach is taken of including new items with old, items. 
in a domain-referenced test, then it is important that the investigator 
not confuse the total number of "scored items" (those not undergoing item 
analysis) and the total number of items physically in the test. Else- 
where in this handbook, when test length, n, is discussed it is always 
assumed that ^,n is the total number of items excluding those (if any) 
undergoing item analysis. 

As discussed above, conducting an item analysis usually involves* 
classifying examinees into groups based on total test score. If new 




30 



22 

4 

tnema aire Includqci with old Itema, than total teat acora ahould be baaed 
on the old LtQir.M only. Of course, in the initial atageW of conatruat- 
ing a universe, or pool of items, total test score will have to be baaed 
on new items only. In either case, the investigator must choose a range 
of scores associated with each group. Seldom can this decision be made 
in a completely unambiguous manner, because a firm basis for this deci- 
sion -would necessitate inforrnation that is seldom available at the time 
the decision needs to be made. , For example, in initial stages of uni- 
verse construction, a cutting score may not have been firmly established. 
Furthermore, as will be discussed later, even under the best of circum- 
stances, it is impossible to assign examinees to groups in a manner 
that is guaranteed to be completely devoid of error. Even so, for 
item analysis purposes a firm basis for assigning examinees to groups 
is not absolutely necessary — good informed judgment based on experience 
is generally sufficient. . 

The above discussion of item analysis procedures has been ^couched 
in terms of multiple-choice items. For free-response items the procedure 
and guidelines are essentially the same. The principal differences are 
that: (a) a free-response item can be viewed as an item with two alter- 
natives — correct and incorrect; and (b) the investigator needs to study 
all examinee responses to make sure that all correct responses have been 
identified. 
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1 . Eayabliflhimj a Cutfclng aporg^ 

^ Dnu of the initial ta«ka •ftypicaiiy dncountartad by an invuatigator 

... -^.^ ' 

\in a domain- referenced testing epvironment ia to eatabliah a cutting 
a^^^^^JCg^^Q ' expressed as a proportion of items correct for the vmiverae 
of item^ Of course, ^ir ia not required if mastery type d^iaiona are 
not gqiiTg to be made and interest is restricted to estimating an exam- 
^ inee's universe score. However, in |^o3t domain-referenced testing 
situations, mastery type decisions ai;;^ made and r consequently , a cutting 
, score is required. ? 

On rare occasions there is a known>^ relationship between examinee 
performance on the universe of items (or . a"" large part of the universe) 
-and some external criterion such as on-the-job performance or perfor- 
. mance in some subsequent level of instruction. Such dataware indeed 
rare, however, because they are usuc^lly very.;^ difficult to obtain. For 
example. If some measure of onrthe-jbb performance is viewed as a crlte- 
rion, then one would have to take the following steps to obtain the 
data required to use such performance as a basis for establishing a 
cutting Score: (a) test a representative group 6f examinees using a 
. large number of items 'from the universe; (b) allow all these examinees, 
Including those with low scores , to undertake the job under considera- 
tion; and (c) evaluate the performance of each of tliese examinees on the 
job. Three problems are usually encountered In atteitjptlng to carry-^out 
these Steps. ' First, these steps are usually .tlme-cons,uming and expensive 



I 24 

atjuoncl, It Iw Cfrequantly uon^ldbrad urui*i« ircihlu (and Houitatilmew 
afchlortlly muinoaptfibltO (:o allow U>w-MU(Hviiv> eiHctiuin««M t:o uiuUn:- 
tiaku tho job In quuation. And fchird, usually the avaXimtion oi? 
on-tho-job porfoanancu la both difflcMlt and aubjeat to conBid- 
er*ibio error. 

i 

^ . ' i 

For these reasons, among othora, external criteria are seldom used 

(at least directly) in the process of establishing a cutting score i?or 
domain-referenced testing purposes. Rather, it is common for a cut^i^ng 
score to be defined based upon the judgments of raters, judges, or experts 
who are content matter specialists. Of course, such judgments are likely 
(indeed hopefully)^ to be influenced by raters' knowledge about potential 
external criteria and about how persons generally perform on such criteV- 
ria. However, such information is not usually quantified directly. 
Rather several procedures exist for eliciting from raters their beliefs 
about how minimally competent persons would perform on the universe of 
items, the argument being that such judgments provide a basis for estab- 
lishing a cutting score tt^ that separates mastery (or probably accept- 
able performance) from non-mastery (or probably unacceptable performance). 

Procedure ^ 

In one procedure for establishing a cutting score* each of a set of 
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tat'tt^a, |u(1g*!i«, or uoifitant; m*^t:t«r; «i)«ui«libt:tt U fittkad to provide txw iiui« 



pai|flont fiaa0fciftim«nl: ot: i)tX)b4bLl U;y tlmt; a miniiiwlly uompiatdiit axdmintjitii 
woulcV tjut: tidch Item corruct, Tha <ivtiracj« probabdility ovar ratdra md 
items (C4.llt3d y below) 'la t?ruque1t>tly uaad aa the cuttincj acore ir , 
and varloua atatiatiaa c^n bo calculated to aanais^i how variable thiu 



avaraqe probability woul^ be if the atudy ware rapiciated a large num- 
ber oe tim^^ti^a. Kjti^wledge/about auch variability ia important in reveal- 
ing the extent b4 ''which'A'atera agree in their judgments about what 
cutting scores shOilld abfujally be eatabliahed. 



Using this pxjibcedure^data are collected in the following 
manner: 

(a) A group^f t, rS|6ei?% , and a sample of m items from 'the universe, 
are identified whejce _t ana m are as large as ti^e and other constraints 
will allow; ^ * f ' ? 

is ^^oj1&' 

reflecting that li'ater * s Miief about the likelihood that a 'minimally 
competent examinee would get that item corrects 

(c) Items S^^^resented to each rater in a i&andom {border — the 



(b) Each rater is ^ojjd' to provide, for each item/ ;a probability 



important point l3ei^g that the items are ordered differently for each 

ra term- 
ed) Each rater works independently of every other rater (i.e., 

raters do not discuss their judgments with each other) ; and 

(d) ftaters are told to report their pi;obabilities in units of 

1/10 (i.e., the probabilities that might be assigned are 0.0, 0.1, 

0.2 1.0). 
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with ^ S iNf^tdifei ami m JO Ui»iiiii4, 't*h«i«i«i nim\t}fiirii fiitt »:<^Utivtily bimrtU 

sioUly Cor t:hu purpota© ut liimpll t'y iiuj ftjuhtt«qvn4nt II Umiirctt iou o^: com- 

puti^tloaw. An «ntiy in the body oi: T<4bU '\A 1** d«uot«d y , tli« LHob- 

rt 

4bUU:y rtM«lgnud by a xi^\m: r t;o aa lt:«m 1. (The ayinbol y la im«d hisro 
to dlMtiiujutah tha«ti piobal^U. Itlua tfrom aKttmin«a ycortjH on <x ttaat, which . 
are latwr donotod with the Jiymbol xO Aioncj with thu probdbiUtiusi , 
Table J, 1 reports mtimia, variancea, and atandard daViationa, For axampi^, 

(a) an ontry in the row laboied y ia the moan probability aaaicjnad 
to items by rater r, and ^^^^ "083 ia the atandard deviation (acroaa 
raters) of theae rater moan probabilities; 

mm* * 

(b) an entry in the column headed y^ ia the moan probability aasigned 
to item 1, and ^ ,086 ia the standard deviation (acroaa itema) of 
these item moan probabilities; 

(c) an entry in the row labeled s(y^j^) is the standard deviation ' 
of the probabilities assigned to items by rater r; and 

(d) y = .80 is the mean probability over all 20 items and all 
5 raters, 

In a cutting score study, interest is, usually focused principally 
on y^ and y. We may call y^ the "cutting score assigned by rater £*' 
because it reflects that rater's belief about the proportion of iteips 
that a minimally competent examinee would get correct. Similarly, 
we may call y the "study cutting score , " and as such it is, in a cer- 
tain statistical sense, the best value to choose for tt . 
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4h»1 4 iM t c45|ci»»M^|i I y 1 4»4i i^attMsiii «MtH«i»Mln»4 chilli a iif 4 ^uif I la^j 

■ i' 

i t uivm iia«il in t tiu ut ility . uiiu piu tjf it t li^^ i^^M ttt iuty 

UMtUitll , 

I'.u t ho pivip^>^**> ^^f' oKrtmtuinti ViU i-iblUty In y , « (y ) in r«l«- 

r: 

vant l)Mt tu)t .ujt uaUy t hu t|Uant:ity of principal' int *n.«nt.*. j<ftlhai , 
DHo wc)\il(l LtUiiilly I Ikii to ktU)W Iidw Varlnblis y won Id bu If ihd utudy wara 
LopU atud (undtu ?iUui. lai; f;oruil,i. lonn) a larqo nvimiHur of t: Lmojj . L^t 
doycribe t-.hiti vai idbllLt:y in y Ln. terma ot" a sitandard deviation aiid 
identity it as o {y) , Clearly, it' o(y) were ainall, then, even if ^fatera 
diijacjree^ to sonkj extent concernincj the cutting score resulting from a 
single s^tudy, swgh disagreement would not seriously impact one * si confi- 
dence in using y as a cutting score. However, if a(y) were large, then 
one might wai\t to keep this tact in mind when making decisions based 
on y . 
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til t hu iiuiiihOi Ifiiiufei lHViklv«d»1 la tsai li I I i I ^li <3it «i\»4v; K»iuat iou 

ill 'i'r»i»Ui K W»iill»t hM I a if .111 iuv«t|< i -J^ll 0» Wi'lUlW*! f»» »:oii:iMot 

'l\ihln \.:, riUi)wri Mutt n(Y) - . 0*i I r^i t lui d^it.^ In 'INtlili^ Kl, It, lu>w«vrM. 

.1 iuit»i)*u ittrrrUMUt r>i>m (n^iUrtlly surtalltu thdu) ui, itum t lit* nppi Dpi; I nt tt 
Vi-M inu4t;« would iw obtaLnod from tOquatlon L in Tablw i. I'm wxampUi, 
«|ivt»ri the iiyrithotit: d^t/i .^rui <i tent IfMUjth of u*^IO iUmH, Tablo J.^ 
?4tu.iWB that a(y) ,()4S. 

A third estimate of a(y) is obtained by assuminv? that replicated 
atudiet^ would each involve ratinq aU items in the universe. Under 
thi3 circumstance, the appropriate estimate of a(y) is Equation 3,3 
in Table 3.2; and for the synthetic data a (y) ^ 0.036, Thi3 value 
is Less than either of the other two estimates of tj(y) because (T(y) 
decre ases as the num]t)er of items increases. 



Table 3.2 

Equations and Illustrative Computations for Determining the Standard Deviation of a Mean Cutting Score 



Equation 



Computations Using Data in Table 3.1 



Let t = number of raters used in study 
m = number of items used in study 



t = 



m = 



5 

20 



Define A = 



m(t-l) 



K Ave rage va 
^ of s^(Y 



/alue\ - (y . ) 



A = 



(20) (4) 



(.0143 + -0122 + ,0148\ 
+ .0115 + .0257 1 



- .0074 



.0001 



Standard deviation of y over different studies 
using t raters and m items: 



(3.1) a(y) =-yS^<yr^/^ "** s^^Vi)/"" 



a(y) = •^(.0069)/5 + (.0074)720 - .0001 = .041 
If n = 10 items 



Standard deviation of y over different studies 
using t raters and some number of items, 
n, different from m: 



(3.2) a(y) = ^ll^{Y^)/t + s^(y^)/n - A 



a(y) 



-i7. 



0069)/5 + (.0074)/10 - .0001 = .045 



Standard deviation of y if each of the 
t raters rated all items in the universe: 
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Any one of these estimates niight be of interest to an investigator; 
however, the third estimate is especially relevant for many (if not most) 
domain-referenced testing situations. Recall that a cutting score is 
defined as a proportion of items correct for the universe of items . 
It follows that ideally one would like to have each rater rate every 
item in the universe to obtain each of the "rater cutting scores." It 
is almost always impossible to obtain such data directly, but even so 
Equation 3.3 allows us to estimate a(y) under this circumstance, • This 
equation is also appropriate if the rating procedure is followed for all 
items that occur in each and every form of a domain- referenced test. 

One particular use of a (y) in Equation 3.3 is in establishing a 
confidence interval for the cutting score. For example if one goes 
one standard deviation to the right and left of y , then one obtains 
"'a 68% confidence interval for the cutting score tt . For the synthetic 
data this interval extends from 

y - a(y) = .800 - .036 = .76 

to y + a(y) = .800 + .036 = .84, 

and this interval is represented (.76, .84). In words, we can say 

that if the cutting score study were replicated a large number of times 

(each time using all items in the universe), about. 68% of the time we 

would expect to obtain values of y between .76 and .84. 

Given these data, therefore, in a certain statistical sense 

y = .80 is the best single number (proportion of items correct) to use as 

a cutting score, ir ; however, an investigator is well advised to enter- 
o 

tain some uncertainty about whether or not this value for tt is "correct" 

o 

in some aibsolute sense. Also, as will be indicated in Section 4 



for some purposes , procedures are available that employ what is called 

an "indifference zone" for the cutting score it ; and the confidence 

o 

interval discussed above can be helpful in picking an indifference 
zone. 

Other Considerations 

One factor that can contribute greatly to differences among raters 

in their y values is differential ideas about what constitutes minimal 
r 

performance. Any definition of minimal competence is almost always 
a matter of judgment (packing a parachute may be an exception!), but 
very disparate notions about minimal con^etcince can render a cutting 
score study of relatively little value. At the same time, however, 
the raters themselves should be well qualified to define what minimal 
competence is, or at least to have a voice in any such definition. 
In particular, it is very difficult, if not impossible, for raters to 
participate in a cutting score study using someone else's definition 
of minimum competence. For these reasons, it is advised that raters 
have the opportunity to discuss their possibly different notions about 
minimal competence prior to conducting the actual study. Hopefully, 
they can reach some consensus or at least mitigate their differences of 
opinion in a mutually acceptable manner. 

• Another issue to be considered is the manner in which items are 
provided to raters — specifically, are the answers provided along with 
the items? All things considered, it is probably, best that answers 
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be supplied. In doing so, one can obtain an additional check on 
the correctness of the indicated answers, and raters are probably 
more likely to pay careful attention to each item individually. Assum- 
ing that the answers are supplied, each rater should be directed to 
indicate any items that he/she judges to beJ^yed incorrectly, if 
it is determined after the raters complete 4heir task that an item is 
keyed incorrectly, it (and the probabilities assigned to it) should 
be eliminated from the study, and the item should be revised or dis- 
carded. If, on the other hand, it is determined after careful cons id- 
^rS^l^ri that a rater said an item wgis keyed incorrectly, but actually 
it was keyed c6rrectly, th(^n that ,J^ater ' judgment j(i%e.., Assigned 
probability) for that item should be eliminated in determining y. 
This can happen — each individual rater is not infallible, even in 
his/her area of expertise. 

Table 3.1 illustrates the rather common occurrence of one rater 
(in this case Rater 5) providing judgments that are markedly different 
from the judgments provided by other raters. Even so (assuming all 
raters were chosen carefully in the first place) , an atypical rater 
should not be eliminated from the study unless there is an obvious 
reason (e.g., sickness) for that rater's atypical judgments. If such 
a reason exists, then all statistics should be re-calculated based on the 
reduced set of raters. [For example, if Rater 5 were eliminated from 
the synthetic data, then the reader can verify that y = .335; s(y^) = .031; 
and, using Equation 3.3, a (y) = .021.] 

e 

X X 
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One modification of (or addition to) this procedure for establish- 
ing a cutting score involves having v the raters ^ as a group, provide 
a consensus probability for each item after they have jindependently 
provided their judgments about each item., Then the mea?i of these con- 
sensus probabilities is used as the cutting score- If, this modification 
is employed, the resulting data should be examined very carefully to en- 
sure that no single rater is exerting undue influence ov^r the.judg- 
ments of other raters. (Also, if this modification is* used^'one ^ould 
keep in mind that forced consensus is not really agreement although 
forced consensus can effectively hide disagxeeittent. ) 
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4. Establishing an Advancement Score 

When domain-referenced testing is employed to make mastery/non- 

mastery types of decisions, it is necessary to consider a cutting score, 

TT^; but, in addition, the investigator must specify an observable score, 

; ^uch that an examinee who gets or more items correct will be 

declared a master; and an examinee who gets fewer than x items correct 

o 

will be declared a non-master. This score is called an advancement 
score, with the symbol x^ referring to the advancement score in terms ' 
. of number of items correct and (later) the symbol c^ referring to the 
.advancement score in terms of proportion of items correct. 

In principle, one wants to pass, or advance, an examinee if that 
examinee's universe score, tt^ , is equal to or greater than the cutting 
score, TT^. However, one cannot directly use such a decision rule be- 
cause a specific domain-referenced teat will consist of only a sample 
of items from the universe. Baaed on any sample of items, an examinee's 
observed mean score, , can be calculated, but not the examinee's uni« 
verse score, tt^. Furthermore, the cutting score, tt^ , may not correspond 
with a possible observed mean score for test of n items. (For example, 
if n = 10, then no proportion of items correct will correspond with a 
cutting score of .85.) 

Let us suppose that, aa a result of some cutting score study, tt 

o 

is specified to be .80, and let us assume that a test will consist 

of n = 10 items. Since .80 x 10 = 8, an investigator might decide that 

the advancement score should be: 
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X ~ 8 in terms of number of items correct; or 
o 



c - X /n 
o o 



= 8/10 



,80 in terms of proportion of | items correct. 



In this example, choosing x to be eight items correct may appear rea- 

O f 

sonaible and, indeed, this particular advancementV sjfore may be a good 

f\ 

choice in some particular context. However, the^ "logic" presented above 

I. ^ * 

for choosing an advancement score is rather superficial. For example, 
this logic does not take into account the fact that an observed score 
may be, and usually is, different from a universe score. As will be- 
come evident later, a more thorough analysis could lead to choosing 
some advancement score other than x = 8. ^ v , * 

The purpose of this section is to provide a" reasonably sound, 
yet relatively simple, table- look-up procedu;te for choosing an advance- 
ment score. Even though this procedure is qiaite simple compared 
to others that might be used, it, does involve consideration of several 
technical issues. Specifically, to use this5 procedure , one must first 
specify a test length, a loss ratio, and an indifference zone. These 
issues are discussed below, followed by an illustration of how to use 
the table look-up-procedure . , 
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Related Issues 

Sometimes, choosing a test length (n) is a more difficult problem 
than it may appear to be ^t first glance. All other things being equal, 
longer tests are to be preferred over shorter tests, because longer 
tests reduce certain types of "errors (discussed more fully later) . 
Also, longer tests are more valid in the sense that they provide a more 
thorough representation of the intended universe of items. At the same 
time, however, in domain-referenced testing environments, factors such 
as available testing time frequently make it very difficult and/or costly 
to use tests that are vefy long. For now, it will be assumed that there 
already, exists some reasonable basis for choosing a particular test 
length, at least for the initial form(s) of a domain-referenced test. 
In subsequent sections , as- different concepts and procedures are devel- 
oped, it will be possible to identify some reasonable statistics to 
consider in choosing, or modifying, test length. 

r 

Classificatibn errors and loss ratio . The concept of a loss ratio 
involves a consideration of errors that can be made in classifying 
an examinee as a passing examinee (master) or a failing examineee (non- 
master) . Specifically, there are two classification errors that can 
be made: 

(a) a false positive error occurs if an examinee is declared a master 

(i.e., advanced) who has a universe score below ir ; and 

o 

(b) a false negative error occurs if an examinee is declared a non- 
master (i.e., not advanced) who has a universe score above tt . 
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These two classification errors are considered more full^in Section 
5 in the context of decisions about individual examinees. Here, our ' 
concern is with a certain kind of judgment about false positive and 
false negative errors. Specifically, in this handbook the term "loss 
ratio" refers to a number reflecting judgment about the seriousness 
of a false positive error compared to the seriousness of a false nega- 
tive error. For example, if false positive errors were judged to be 
twice as serious as false negative errors, then the loss ratio would be 
two; and, if both types of classification errors were equally Serious, 
then the loss ratio would be one; * 

By definition', the specification of a loss ratio involves sub-' 
jectiye judgment on the part of a person (or persons) intimately famil- 
iar with the testing context. In making this judgment one needs to 
consider the consequence^ of inappropriately passing or inappropriately 
failing an examinee. For example, in many domain-referenced testing 
contexts, it is frequently argued that an examinee who is inappropri- 
ately advanced (false positive error) is likely to be unsuccessful 
on-the-job or in subsequent instruction; and, this type of error is 
judged more serious, than the time and cost involved in inappropriately 
re-cycling an examinee through an instructional sequence (false nega- 
tive error). These particular judgments suggest that a loss ratio > in 
such contexts, should be defined as some number greater than one — perhaps 
two, but probably not three unless instructional time and cost are quite 
unimportant. 

i2 
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Indifference zone . An indifference zone/xs s6me range of univer3e 
scores within which one is "indif ferenft^out false positive and false 

negative errors. Let us identify th^id^r ilimit of this range as tt , 

- . ' . L 

the upper ^imit as tt / and the raoge itself as (tt , tt ) . Suppose 

" L H 

an investigator is able to specify values for tt and Tr__ such that, 

for any examinee whose universe score is between tt and' tt , there is 

^ L H 

virtually no loss involved in declaring a true master to be a non-master 

or in declaring a true non-master to be a master- In such a case the 

interval (tt , tt ) may be viewed as an indifference zone. This rather 

direct approach to defining an indifference zone may or may not make 
... 

sense in a particular context. 

Another appr^^^^^Sjj^ves the procedure for establishing a cutting 
score discussed in Section 3. Specifically, consider again a(y) in 
Equation 3.3, which is the standard deviation of y over replicated j 
studies, if each study involved all the items in the universe. It'yas 
stated in Section 3 that y can serve tt^ and a 68% confidence inter- 
val for TT^ can be viewed as extending from y - cr(y) to y + C7(y), approx- 
imately. This confidence interval (or something close to it) might 
be viewed as an indifference zone . Consider, for example the synthetic 
data treated in Section 2. For these data, y = .80; using Equation 
3.3, a{y) = .036; and the 68% confidence interval is (.76 to .84). 
Since this interval indicates a degree of uncertainty about some "ideal" 
value for a cutting score, it seems reasonable to assume that an investi- 
gator might have little basis for being anything but indifferent about 
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classification errors for examinees whose universe scores lie in tfhe 
■ interval C-76 tg .84). • 

In considering either of the above approaches to establishing an 
indifference zone, it needs to be recognized that these procedures 
are not to be viewed as statistical excuses for being indifferent, in 
the sense of uncaring, about individual examinees who have observed mean 

scores close to it . Rather, these procedures are to be viewed as aids 
o 

in the process of establishing an indifference zone, which is a neces- 
sary consideration for picking an advancement score using the table 
discussed below. 

Advancement Score Table 

Given a test length, a loss ratio, and an indifference zone. Table 
A.l provides a specific advancement score, x^ , in terms of number of 
items correct. (To obtain the advancement score in terms of proportion 
of items correct, one simply uses the relationship c^ = ^q^"'^ 
of Table A.l are associated with different test lengths, ranging from 
^ 6 to 30 items; and the columns are associated with 20 indifference zones, 
organized according to the mid-points of the zones, with mid-points 
ranging from .65 to .90. For each row and column, there are three 
tabled entries (separated by slashes) corresponding to advancement 
scores associated with loss ratios of 1, 2, and 3, respectively. 

To illustrate use of Table A.l; let us consider the following 
judgments about test length, loss ratio, and indifference zone: 
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(a) Teat length. Let us assume that testing time is at a premium, 
and the universe of items ia rather narrow. Taking these two considerar 
tions into account, it is judged that about n = 10 test items seems 
reasonable, 

ihy Loss ratio. Let us assume that the domain-referenced testing 
context is one in which false positive errors are judged to be somewhat 
more serious than false negative errors, and a loss ratio of about two 
seems reasonable. 

(c) Indifference zone . Let us suppose that it is decided to use 
the results of a cutting score study in making judgemnts about an indif- 
ference zone. Specifically, let us suppose that the results reported 
in Section 2 are based on the appropriate universe of items. This study 

suggests that an approximate 68% confidence interval for tt is ( 76 to 84) 

o 

and it will be assvmied that this confidence interval can serve as an 
approximate indifference zone. 

Now, given the above judgements, to pick an advancement score, 
one uses the fifth row (n = 10) and second column(.75 to .85) of the 
second page of Table A.l. The tabled entries corresponding to this 
row and column are 9/9/9. Since all of these entries are the same 

number^ it is obvious that the advancement score is x = 9 or c = 9/10 

o o 

= .90. To be specific, since the loss ratio has been defined as two, 
the second entry is actually the advancement score for this illustration. 

In the above example, note that the indifference zone (.75 to .85) 
specified in the second column of the second page of Table A.l is not 
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exactly equal to the indifference zone of (.76 to ,84), which was ini- 
tially chosen. Any such slight disparity can be overlooked without 
serious consequences, because, for the most part, the procedure used 
to develop Table A.l is insensitive to small disparities in indifference 
zones. Furthermore, it is not necessary that ir^ be exactly at the 
midpoint of the indifference zr>ne. Indeed, for ^reasons beyond the 
scope of this handbook, it is sufficient that be somewhere within 
the indifference zone. 

Table A.l ihdicates (and the above example illustrates) that this 
procedure for choosing an advancement score is also relatively insen- 
sitive to small ctianges in loss ratio. Indeed, for any specific test 
length and indifference zone in Table A,l, the suggested advancement 
scores differ by at most, one correct item. 

The above points about "ins^nsitivity" have been made to highlight 
the fact that this procedure for choosing an advancement score does not 
necessitate arguing about minute differences of opinion with respect to 
an appropriate indifference zone or loss ratio — a reasoned consideration 
of these issues is sufficient for the procedure. ^ 
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5. Errors of Meaaurement , 
Errors of Classification , and 
Inferences about an Examinee's Universe Score 

p 

^ Sections 2, 3, and 4 have considered^' issues that are addressed prior 
to making any decision about an examinee. Let us now assume that the 
issues discussed in Sections 2, 3, and 4 have been addressed, a domain- 
referenced test of ri items has been administered to a group of examinees, 
and each examinee's score on the test has been determined. In this section 
consideration is given to the precision, or quality, of certain statements, 
or decisions, that might be made about an examinee. To address these 
issues, the only exandnee datum that will be employed is the examinee's 
test score. To simplify notation in this section, usually the examinee's 
number of items correct will be denoted x, the examinee's proportion of 

items correct will be denoted x (rather than x ), and the examinee's 

P 

universe score will be denoted tt (rather than ir ). 

p 

It cannot be emphasized enough that tt is always unknown, and x- is 
only an estimate of it. Consequently, there is always some degree of 
uncertainty cQx5Ut any statement concerning tt. For example, if x = .80, 
one may say that tt is "about" .80, but this statement clearly suggests 
that ^ and x may be different, and perhaps dramatically different. 
This difference between x and tt is called an error of measurement . 

Furthermore, since x is an imperfect estimate of tt, mastery/non- 
mastery decisions based on x (or x) may be incorrect, and an error 'df-. 



claaaif icatlon may ba made. This luaue waa introduoad in tha previous 
section in the context of specifying a loss ratio. In 'thia ^aection, 
errors of ciaaaif icatio^^are considered in more detail, from the perupac- 
tive of,, decisions about examineea. ^ 

It needs to be recognized that, since tt is unknown, one cannot 
specify whether or not a classification error has been made for an 
individual examinee; nor, can one specify a particular value for an 
individual examinee's error of measurement. However, given n and x 
(or x) , it is possible to make statements about the probability of 
correct and incorrect decisions, and about likely values of tt. Pro- 
cedures for doing so are described and illustrated in this section, ' - 
after a more^ detailed consideration of errors of measurement and clas-- 

^ 

sif ication. 



45 

yjrrora of Meaaurement and ClaBBlficatlon . 

Recall that an axaminae'a univerae score is the porportion of items, 
ir, that the examinee would get correct if the examinee were adminia tared 
,all items in the universe. Suppose an examinee takes a dbmain-referenced 
test with n 10 items and gets x « 8 items correct. It should be intui- 
tively obvious that this does not necessarily mean that the examinee's 
universe score is x = x/n 8/10 = .80. After all^ the examinee wag tested 
with 10 items, only; and it is to be expected that x = .80 is an imperfect 
estimate of the examinee's universe score. Th^is imperfection in measure- 
ment is .called measurement error. Specif i-cally, measurement error is the 
difference between an exmaminee's test score (expressed as a proportion of 
items correct, x) and the examinee's universe score: 

A = X - 7T. 

Note the use of the symbol A to designate measurement error. Clearly, 
A can be either positive or negative, as well as being either large or 
small . 

It is evident from the definition of A *that a cutting score, tt , 

o 

plays no role in considerations regarding error of measurement. However, 

for mastery /non-mastery decisions a cutting score, tt , is involved; and for 

o — 

such decisions, an error of classification may be made in addition to an 
error of measurement. As noted in Section 4, there are two types of errors 
of classification: 
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(a) a faXae poaitivu qrror (f-f) occura if an examinee ia declared a 

maater (x > x ) when the examinee* a univerae score ia below n ; and 
— ~ — — o o 

(b) a false negative error (f-) occura if an examinee ia declared 

a non-master (x < x ) when the examinee's univerae score ia at or above tt . 
o o 

These two poaaible errora of claaaif ication are repi^eaented in Table 5.1 

along with the two possible correct decisions—namely, passing an examinee 

who haa a universe score at or above tt^ (c+) , and failing an examinee who 

has a universe score below tt (c-) . 

o 

To better appreciate errors of measurement and classification, consider 

Figure 5,1 in which it is assumed that it = .80, n = 10, and c = .90. For 

o o 

12 pairs of^'yalues for x and ir, Figure 5.1 represents the resulting error 
of measurement and error of classification or correct decision. As illus- 
trated in Figure 5,1: 

(a) a false positive decision implies that a positive error of measure 
ment (x > tt) is involved (see lines G, H, and I in Figure 5.1); 

(b) a false negative decision implies that a negative error of mea- 
surement (x < it) is involved (see lines J, and L in Figure 5.1); and 

(c) even when a correct (positive or negative) decision is made, an 
error of measurement (positive or negative) may be involved (see lines A-F 
in Figure 5.1) . 

In short, the occurrence of an error of measurement does not neces- 
sarily mean that an error of classification will be made; however, an error 
of classification is always associated with an error of measurement, and 
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Table 5.1 



Corroat Maatery/Non-Maatery Deciaiona and 
Errors of Clasaification 



Observed 
Score 


Universe 

TT < TT 

o 


Score 

\ 

TT > TT 
— O 


X < X 

o 


Correct Negative 


\» 

False Negative 


(Fail) 


Decision (c-) 


Error (f-) 


X > X 

— o 


False Positive 


Correct Positive, 


(Pass) 


Error (f+) 

v. ■ ■• ' 


Decision (c4) ] 



Note. The symbol > means "greater th^n, "^the symbol >_'meana 
"greater than or equal to," the symbol •< 'meank "le^V th^n, " and n ' 
the symbol £ means "less than or equal to." 

"If 
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'Figure 5.1. Illustration of Errors of Classification and Errors of Measurement, 
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lar(]o or ainall--auch an error ia oithor; uvado or it ia ,nt>t made , nothing 

\ I ^ 

morti. For example .lines G and I in Figure 5,1 both rtf^iiresent talae poai- 
tive classifications errors, and line G does not ropre^ient a larger clas- 
sification error than line I, Rather, line G represeiVts a larger error 
of measurement than line I. ' 

It needs, to be recognized that, since an individual ; examinee ' s uni- 
verse score is unknown, we cannot directly determine the error of measure- 
ment for an individual examinee. For the Scune reason, it is impossible 
to say, for certain, whether or not a classification error has been made 
for an individual excuninee. However, given n and x (or x) it is possible 
to make statements about: (a) probabilities associated with correct and 
incorrect decisions; and (b) likely values for tt. Procedures for doing so 
are treated in the next two parts of this section. 

Probabilities of Correct and Incorrect Decisions 

Since one cannot say, for certain, whether or not a classification 
error has been made for an individual examinee, it is reasonable to ask, 
"How probable is it that an examinee with a score of x (or x) on an n-item 
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trtkwn l^iatfcs to rtiittwtit Uuj thi« ijUfcifciiion Involvcjtt u«lu*4 'I'c^hU A, J wlUoh w<^u 

imj, l:htJ«ii <tayum|»t:ttinu ImtUy thrtl: a ll wej know ^hoiit <\\\ uKamlntitj Iti l Ut* 
dKamlnfcJM^ ti ttuit kicort»/ c^nU thw t'rttn: thcil: the wxciminwe look ^ ttiRil: pontiitit" 
UmI oil n uamplu at n Ltwmu tft'om a Uir'ijtJ uulvoiuti ot LtuiUH. 

Table ptovLiUm a ^atup-by-iit.op procudutu, with uKrtmplou, iUu: (iotut- 
mlnlnq probabilltloii aHaocitttotl with corrui-t and Incoiri^ct doclfjioiiu* 
Thiii procuduro involvoB nothlntj more complicatud than idontlfyinq an ontry 
in Tablo A. 2 and poaaibly aubtractimi it r:rom 100. Noto that, in thin 
handbook, a probability la usually Identitiied and diacuaaod ao a percent 
ranging from 0 to 100. This convention has been adopted to ^lVoid confus- 
ing a statement about a probability with a statement about an examinee's 
universe score (tt) or observed mean score (x) , both of which range from 
0 to 1: 

It is suggested that, whenever maatery/non-mastery decisions are to 
be made, the investigator examine the probabilities in Table 5.2 — at least 
the probabilities of incorrect ^decisions for examinees near the cutting 
score. For example, using the procedure in Table 5.2 with n = 10, 

TT = .80, and c = .90, 

00 

if X = 6, 

if X = 7, 

if X = 8, 

if X = 9, and 

if X = 10, 

Co 



Prob (f-) = 5% 

Prob (f-) = 16% 

Prob (f-) = 38% 

Prob (f+) = 32% 

Prob (f+) = 9% 
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Kx.impUs I : V\ (( ) ( 100 \U) ^ H^l 



Hxamplo 1: I'lol) (I-) U)% 



I* t obtibl lily t) t a Cor roc t Pofi 1 1 L vo Dec 1 sj i oi\ 
(S. J) Pr(»b (c^>) - 'lU*:^ 



Kxamnlti 2: Prnh) (c?i) ^ 6H^ 



Probability ot a False Poaltivo Decisioii: 
(5.4) Prob (f O = (100 - TE)% 



Example 2: Prob ( t: ) 



( 100 - GB)% 32% 
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*M rf uitiMt cinri'M, tin* proh.ih i 1 i t y ot .t I .» 1 f I i vn »MM)t Im 1 JU . 

'Hu> (>>\:)h.il> I, I 1 t LrM' ol (rurpc't luorio. f ^ih - 1 s i i >iis toMull i mm from flu? 

[»r (u.^idurf^ out.liip'd in 'l'al)lo S . not (iep<:M\d r»n l\<ivuuj i?x<imif\rM» -icor o-; 

'-'^^ ::L-^:'jMA.i:, rathor, tJie^JO i)tc>babt 1 it irs aro tor any le^st ccju^riis- 

t inw '^t 1 :;ample ot 10 items t"rom a very latqe univetso. tt. tollows that 
an uwest^igatot miqht corusider makmq a decision about. taMU. lenqth based 
on vin examination of ptobabi 1 ities ot incorrect decisions, t:or tests ot 
ditterent ienqth. In Section 6 a closely related issue is treated in 
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Table 5.3 ^ / 

Use of Table A. 2 to Make Statements about Likely Values for tt 



Procedure and Equations 



Examp] es 



Probability that tt:.jLS Between tt^ and tt^ 
Giveji nr x, «^nd tt^: 

(a) using the left- hand side of Table A. 2, 
* locate the row for n and x; 

(b) let TEj^ be the tabled entry in this 
row under the column headed tf^; and 

^ . (c) let TE^ be the tabled entry in this 
row under the column headed tt^. 

(5.5) Prob (tt^ < TT < tt^) = (TE^ - "^^^^ 



Suppose n = 10, tt^ - .75, and ^ • 



Example 1: If x = 7, then 
TE^ = 29, TE^ = 7, and 

Prob (.75 < TT < .85) = (29 - 7)% = 22% 



Example 2: If x = 9, then 
TE^ = 80, TE^ = 51, and 

Prob (.75 < TT < .85) = (80 - 51)% = 29% 



P% Credibility Interval for tt ' 
Given n, and P: 

(a) locate the row for n and x in the 
ri?ght-hand side of Table A. 2; and 

(b) let (TT^ , TT^) be the tabled entry 
in tliis row under the column 
headed P-Percent. 

*(5.6) A P% Credibility. Interval for tt =^ (tt^ , tt^) 

^- (i.e., there is a P% probability that 



Suppose n = 10 apd P% = 80% 



#RIC 



TT is between tt^^ and i^^"^ 



Example 1: If x = 7 

A P% Credibility Interval for tt = (.51, ,85) 

Example 2: If k = 9 

A" P% Credibility Interval for tt = (.74, ,98) 



en 
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By contrast, (b) ^nswers the question: 

"Given n, x, and some desired degree of certainty, (P%) , what 
is a range of values which probably includes ir?" 
For example, given n = 10 and x = 8, Table A. 2 reports that: 

(1) with 67% certainty ir is between .67 and .90; 

(2) with 80% certainty ir is between .62 and .92; and ' 

(3) with 90% certaint^/^j^ 13 between .56 and .94. 

Note that if one wants to have a greater degree of certainty about the 
range within which an examinee's universe scor| probably lies, then one 
must tolerate a wider interval. For example, tHe interval (.56, .94) for 
90% certainty is quite a bit wider than the interval (.67, .90) for 67% 
certainty. 

Also, given x and some desired degree of certainty, the width of an 
interval' decreases as n increases, .For example, given n = 20 and x = 16, 
X = .80 and from Table A. 2 a 67% interval is (.71, ,87). This interval 
is shorter than the corresponding interval (.67, .90) for n = 10 and x = 8. 
In this sense one can say that long tests are better than short tests, or, 
more specifically, longer tests a^e generally associated with a smaller 
average error of measurement for examinees. This issue of test length 
and its relationship with errors of measurement is treated in detail in 
Section 6. 

The intervals reported in Table A. 2 are sometimes described as cred- 
ibility intervals. Specifically, Table A. 2 reports 67, 80, and 90 percent 
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credibility intervals associated with observed mean scores of x > .50/ 
for test lengths ranging from 5 to 30 items. Similar results can be ob- 
tained for other intervals, other test lengths, and/or other observed mean 
scdtes using the procedure outlined in Table 5.4. Actually, an interval 

Ob tairvQ^ using the procedure in Table 5.4 is called a confidence interval 



rati^-than credibility interval , and the interpretation of a confidence 
interval is slightly different from the interpretation of a credibility 
interval. However, for most practical purposes they can be interpreted 
in £ibout the same way. 

As indicated by the example in Table 5.4, one can say with about 
68 percent confidence that an examinee with an observed mean score of 
.75 on a 20-item test probabily has a universe score between .65^and .85. 
By comparison, consider the "corresponding" 67% credibility interval provided 
in Table 7^2. This credibility interval extends fro|n .65 to .83. Clearly, the 
two intervals are quite close, but not exactly the same. In general, it 
is recommended that. the credibility intervals in Table A. 2 be used when- 
ever possible, and that the procedure in Table 5.4 be. used when Table A. 2 
does not apply. For example. Table A. 2 does not provide 95 percent inter- 
vals, but the procedure in Table 5.4 can be used to obtain such intervals. 
(Note, however, that the procedure in Table 5.4 does not apply if 
x^ = 0 or 1; and this procedure involves a normality assumption that 

becomes less tenable as x approaches either 0 or 1. ) 

P 

In this author's opinion, in^domain-ref erenced testing, it is usually 
advisable to determine credibility or confidence intervals for examinee 
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tV Table 5.4 



Equations and Illustrative 'Cdmputations for ^^Obtaininq Confidence' Intfer^als for 




?s Univers:e* Score" 

if?' ' ' ■ - 



Equations and Procedure 



Let n 



number of items in test 
examinee's observed mean' ^ore 




Jupposfe n • r 'rt'^ 20 , 

1 >6. (i.d*'-/ = 15-il\fems correct) ' 



step 1: 



Calculate 



(5.7) 



(1-x ) 

p p 



n - 1 



a(Ap) = 



,75(1-. 75) 



20-1 



0099 = .10 



Step 2: A p percent confidence interval for 

the examinee's universe score extends from 



(5 



where 



X 

p 


2 o(A ) 

p 


to 


X + z 

p 


o(A ) 
P 


68 


percent confidence 


z = 


1.00 


if 


p = 


68 (percent) 




.75 - I.OO(.IO) to 


z = 


1. 15 


if 


p = 


75 (percent) 






z = 


1.29 


if 


p = 


00 (percent) 


95 


percent confidence 


z = 


1.65 


if 


p = 


90 (percent) 




.75 - 5,.96(.10) to 


z = 


1.96 


if 


p = 


95 (percent) 







.85 



\ 

enas 



.95 



Note. In Figure 1.1, z = 2 is used as an approximation to z = 1.96 when p = 95%. 
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J, 

universe scores — at least those examinees about whom important decisions 
are to be made. If nothing else, such intervals are usually very reveal- 
ing indicators of the amovint of measurement error possibly involved in 
using x as if it were tt. If an investigator feels that a specific inter- 
val is too broad for a specific decision, then the investigator might con- 
sider retesting the examinee. 

Suppose, for example, that an examinee got 8 out of 10 items correct, 
initially, with a 67% credibility interval for ir extending from .67 to 90'. 
If the examinee were retested and got 10 out of 10 items correct, then for 
the combined tests n = 20, x = 18, and a 67% credibility interval extends 
from .82 to 95. This latter interval is conside^a±)ly narrower than the 
former one; and, of course, the additional information supplied by the 
retest suggests ^that the examinee's universe score is probably higher 
than origina,lly expected. 

s , 
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^- Group-Based Coefficients of Agreement and 
Measures of Error 

^ Section 5 considered errors of measurement and errors of classifi- 
cation based on an individual examinee's score on a test. This section, 
considers issues involving ^roup performance on a test. Specifically, 
the principal statistics to be discussed are indicated in Table 6.1. 

The statistics 1 ifj^^ and {^) in Table 6.1 are closely related 
to errors of classification and errors of measurement, respectively. 
Specifically, 1 " can be interpreted as the probability of an incon- 
sistent decision; and (A) can be interpreted as the average value of 
the squared errors of measurement for examinees. As such, these statis- 
tics provide information about errors for a group of examinees, as opposed 
to an individual examinee. 

The other statistics in Table 6.1 are called agreement coefficients 
in this handbook. Each of them has a value somewhere between 0 and 1, 
with higher values indicating greater degrees of agreement than lower 
values. The notion of "agreement" reflected by these coefficients in- 
volves considering what would happen .(hypothetically) if_ examinees were 
administered many domain-referenced tests, with each test consisting o^ , 
a different sample of n items from the universe. For a given tes;t 
(n) , a high value for an agreement coefficient suggests that there would 



be a high degree of consistency in certain scores on these different 
tests. For example, i_f we/^new that most persons classified as masters 
on one test would be cla^^fied as masters on most other tests, too,, 
then one type of agreeme'nt woiUd be relatively high. Although the above 
conceptual explanation of agreement coefficients rests on considering 





Table 6.1 

Loss Functions, Agreement Coefficients, and Errors 
Based on Group Performance on a Test 



Type 

of 
Loss 



Agreement Coefficients 



Not Corrected 
For Chance 



Corrected 
For Chance 



Errors 



Threshold 
Squared' Error 



*{c ) 
o 



Kappa 



1 - p^ 



a2{A) 



7^ 
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multiple tests, in practice these coefficients can be estimated using a 
single test, only; and in this handbook such single-test estimates are 
the only ones given detailed consideration. 

The statistics in Table 6.1 can be classified into two categories 
based on the type of loss function involved in defining them. ^ These 
two loss functions are called "threshold" loss and "squared error" loss. 
The subject of loss functions, per se , is a highly technical consider- 
ation that will not be treated in great detail here. For present pur- 
poses, it is sufficient to know that (a) a threshold loss function 
involves consideration of errors of classification, assumes that all false 
positive errors are equally serious, and assumes that all false negative 
errors are equally serious; and (b) a squared error loss function in 
domain-referenced testing involves consideration of errors of measurement 
and aissumes that the seriousness of an error depends on (among other 
things) the squared distance between an examinee's observed and universe 
scores. Later, more will be said about these two 'loss functions; for now ' 
the reader should simply recognize that these two loss functions involve 
different approaches to addressing similar types of issues. 

To develop some further understanding of the statistics in Table 6.1, 
suppose that test scores were available for a group of examinees on two 
fprms of a domain-referenced test. Under this circumstance, the threshold 
loss coefficient denoted p in Table 6.1 wa>uld be 



Proportion of examinees/classif ied as 
masters on both forms 

Proportion of examinees classified 
as non-masters on both forms 
The coefficient p^ is, in effect, the proportion of examinees consistently 
classified into the same .category (mastery or non-mastery) on the two tests. 
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It follows from the above paragraph that 1 - p^ is the proportion 

of examinees who are inconsistently classified on the two tests (i.e., 

classified as a master on one form and a non-master on the -other) . This 

proportion of inconsistent classifications is a group-based measure of 

error in a threshold loss sense ^ when scores on two tests are available^ 

The threshold loss coefficient p is not corrected for the expected 

^ o 

"chance" agreement if all examinees were randomly assigned to a mastery 

or non-mastery status on each of the forms. The threshold-loss coefficient 

corrected for such chance agreement is called Kappa, which is defined as: 

Kappa = (P - P )/{l - p ) , 
o c c 

where p^ is chance ag^ement. In a sense, Kappa is a "pure" measure of 
agreement attributable to the testing procedure, under threshold loss 
assumptions. V 
The reader needs to be cautioned not to take the above "two-test" 
analogy too literally. It is offered simply as an aid in thinking about 
these statistics. Again, in this section the procedures treated involve 

a single administration of a single form of a domain-referenced test. 

As noted in Table 6,1, corresponding to each of if^these three threshold 

loss statistics there is a statistic for squared error loss. For example^ 
a*^ (A) is the average squared error of measurement for the population of 
examinees, and the two agreement coefficients for squared error loss 
involve a-^ (A) . These squared error loss statistics provide a different 
perspective on agreement (and disagreement) . 

r 



63 



Throughout this section all reference to a cutting score, tt , is 

o 

replaced by consideration of c ~ x /n, the advancement score in terms of 

o o 

proportion of items correct. That is, in considering both squared error 
loss and threshold loss, c^ is sometimes used when it might be argued that 
TT^ should be involved. To do so, however, would necessitate considerable 
complexities, no matter what loss function is involved. 

Finally, it should be noted that some persons refer to 'the agreement 
coefficients discussed in this section as "reliability" coefficients. The 
word "reliability" is not used here principally to avoid unwarranted asso- 
ciations between the coefficients in Table 6.1 and classical reliability 
coefficients for norm-referenced tests. Given this caveat, however, much 
of this section treats issues traditionally associated with measurement 
consistency, or "reliability" considerations. (Also, in a sense mentioned 
later, these issues have validity connotations for domain-referenced inter- 
pretations.) 

Squared Error Ljoss 

Squared error loss statistics are conceptually more involved than 
their threshold loss counterparts. Here, however, intital consideration is 
given to squared error loss statistics because there are certain computa- 
tional conveniences in proceeding in this order. 

Suppose that an n = 10 item test were adminsitered to k = 25 exam- 

\ 

inees; and suppose that after the items were scored, the resulting data 

\ 

matrix was that given in Table 6,2. An entry in this data matrix is denoted 
^pi ' score (0 = incorrect, 1 = correct) for examinee £ on item i^. 

f 



7S 



64 



Table 6.2 
Group Performance on a Test: 
A Synthetic Data Set with Sample Statistics 



Item 



Person 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


X 




I 


' 1 


1 


1 


1 


1 


1 


1 


i 


1 


i 

1 r 


1.0 






X 


]^ 


1 


1 


1 


1 


1 


1 


1 


1 1 


l.C 




3 


1 


1 


1 


1 


1 


1 


1 




1 


i 

1 j 


1.0 




4 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 1 


1.0 




5 


1 


1 


1 


1 


1 


1 


1 


1 


1 




1.0 




6 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 ! 


1.0 




7 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 , 


1.0 




8 


1 


1 


0 


1 


1 


1 


1 


1 


1 


1 1 


.9 




9 


1 


1 


1 


1 


1 


1 


1 


0 


1 


1 j 


.9 




10 




1 


1 


0 


1 


1 


1 


1 


1 


1 1 


.9 




11 


1 


1 


1 


0 


1 


1 


1 


1 


1 


1 j 


. 9 




12 


1 


1 


1 


1 


1^ 


1 


1 


1 


1 


0 ' 


. 9 




13 


1 


1 


1 

\ 


1 


1 


1 


1 


I ^ 


1 


0 ' j 


.9 




L4 


]_ 


0 




1 


1 


1 


0 


1 


1 


1 1 


.8 




1 ^ 

L J 


1 

X 


1 




]^ 


0 


1 * 


1 


1 


0 


1 1 


. 8 




^ O 




JL 


\ 




0 


1 


1 


0 


1 


1 i 


. 8 




17 


' 1 


1 


1 


1 


1 . 


1 


1 


0 


0 


1 1 


.8 




18 


1 


1 


1 


' 1 


1 


1. 


1 


1 


0 


0 ' ' 


.8 




19 


1 


1 


1 


1 


1 


1^ 


1 


0 


0 


1 1 


.8 




20 


0 


0 


1 


I 


1 


0 


1 


1 


1 


1 ' 


.7 




21 


1 


0 


1 


0 


0 


1 


0 


1 


1 


' 1 1 


.6 






1 


0 


1 


1 


1 


1 


0 


1 


0 


0 ' 


.6 




23 


0 


1 


1 


1 


0 


0 


1 


1 


0 


1 1 


.6 




24 


1 


0 


1 


1 


1 


1 


0 


0" 


r 

0 


0 1 






25 


0 


0 


1 


1 


1 


0 ■ 


0 


1 


0 


0 j 


.4 




1 


.88 


.76 


.96 


.88 


.84 


.88 . 


80 


.80 


.68 


.76 1 


X = 


.824 










s2 


(X.) = 

1 


.0058 






- > 


' s 

*i 1 


2(x ) = 

p 


.0282 



s(x.) = .076 
1 



s(x ) = .168 
P 
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other statistics reported in Table 6.2 are as follows; 

(a) X is the proportion of-*items thafe^examinee £ got correct; 

p ■ 

(b) s (x ) and s (x ) are the variancJPIjMK&tandard deviation, 




respectively, of the scores x ; 'a 

p * - 

(c) is the proportion of persons who got item i_ correct — i.e, 

the item difficulty level discussed in Section 2; 

/ 

2 - ' - ' 

(d) s (x.) and s(x.) are the variance and standard deviation, 

respectively, of the item difficulty levels; and 

(e) X is the mean proportion of items correct for persons, or, 
equivalently , the mean difficulty level for items. 

Using these sample statistics. Table 6.3 provides formulas, with 
illustrative computations, for estimating agreement coefficients and 
other quantities of interest involving squared error loss. (These 
formulas are used here because they are as computationally simple to 
use as any that can be^derived; however, other more computationally 
difficult formulas would be better in terms of revealing certain under 
lying theoretical issues.) 

: Universe score variance. It has' been emphasized repeatedly in 

previous sections that an examinee's observed score, x , i^ not neces 

P . . 

sarily equal to his/her universe score, tt . It follows that the vari- 

P 

dnce of examinees* observed scores, s^(x ), is not necessarily equal 

P 2 -X ^ 

to the variance of examinees' universe scores, a^{7r ), which is abbrev 

P 

iated a^(Tr) in ^Table 6.3. Actually, (tt) is almost always less than 
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[qua red Error. Loaa 



Compute t^ions U^i^nfl^ OWt^ Table 6.2 



I^t k - number of oxamlntioa 
n - nuinbor of Items 



k 25 

n = 10 




c =5 H /n = advancemunt ^core in terma of 
" proportion of items' cQ.jrvrect ^ 



9/10 .9. 



'Universe Scqre Varii^nco' 



o2(Ti). = 



(6.1) 

Error' Variance 



.(6.2) a2(A) 



o2(n) = 



, • : (n-l) (k-l) V 



25[10(.0282) + - . 824(1-. 824) ] 

(9) (24) 
[a(ir) = -yJ.OieS = .129] 



x(l-x) - s2 (x ) 
P 




■of (A). = 



n.-'l 



= .013d 



Agreement Coefficient Not Correctec^for 'C^axi(^?i-'^ ,-''\ 

.^••*(c ) ff^-- ^'-'V^ :-m-^- ■ ■ > 

O " ^. 



[a(A) = ^.0130 = .114] 
.0282 + (.824 - .9)2 - .0130 



.028^ + (.824 - 9)2 



s2 (x ) +^ (x -/c )? V 



KR-21'' 



(6^4) 



^v My 

.s, 



.617 



.0282 - .0139 



"^' 3 • '^2iZ.'^ 



KR 



KR:r21 = 



0282 



= .539 



Agreemep't^oef f ici^nt 'Gorre'^^ed for Chance . ' 



.0165 
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-6.5) • 



i '> a"^ (n) 

• h. ■ • ■ 



* = 



.0165 + .0130 



0^ 



= • .559'5-v 
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the obaerved acoro variance*. Thia tact ia not immediately evident from 

E'qUiition 6.1 in Table 6.3; but the computation section of Table 6,3 

shows that a^dr) = .0165, a value conaiderably smaller than a^(x ) =» .0282^( 

P 

Note that the square root of a^(TT) ia aimply the atandard deviation of 
examinee universe scores, which is a^(Tr) = .129 for the synthetic data. 

Error varaancev , Recall from Section 5 that error of measurement 
is defined as the difference between an examinee's observed and universe 
scores: 



A = X - TT . 
P -P P , 

If we were to square these differences for all examinees, and then get 
the average of these squared differences, we would obtain (A) . Of 

course, tt is never knowm' exactly, so neither is A ; and, consequently, 

p p - 

0^ (A) cannot be obtained direc.tly by averaging the squired values of 
A . However, one can estimate (A) using Equation 6.2 in Tables' 6.3, 

■ p ' < ^ : ■■1^,. ^ ^ ■ 

and the square root of this value is an estimate of the standard devia- 

.-^ * . ' 

tion of examinee errors of measurement. For the data in Table 6.2, 
TabJ,e 6.3 shows that a^CA) = .0130 an^ a(A) = .114. It is not immed- 
iately evident frbni vTaib 6. 3 but (A) depends upon the variance ^ 
of item dif f icM't^^\;l-ey among other things. In general, the smaller : 
the variance of i:^m;>;(iifficulty levels, the smaller the value| of' (A) . 



6fl 



At jreemen t coef t: iciu nt: not corrected for cheince,.''^l^he abbvia cjis- 
cuaalon otv uniy0rBe acoru variance and error variance makes il6 raf- ^ 
erence to' maatej^y/aon-maatery decisions. When such decisions are to 
bjL» made, the.. advancement score plays a role in the definition of an 
agreement coefficient not corrected for chance, although error variance 
is still (A) - .this agreement coefficient is defined as; 



. . \ - 0^ (tt) + (u - c )2 
' . ' *fc ) =/ — — : ; 

K is the adivanc^ent. st!ore in terttis of proportion of items 

over the universe of items and the 
■JVs' such, \i ii^a^igimilarities with x, but is not 

inltion is rather difficult to use directly 
er' formula is provided by Equation 6.3 



in j^jn .c^)^, the squared dif^ 




for the synthetic data 

e,. 3 ^sho^ that *(.9) .62. 

'^.t- «r ' ♦ ■ ' - ■- 

le value, of ' * (c ) if x actually 

'5 L also 



disqv|ssed ^LVter 




threshold loSs agreqjn^^ coeffi- 
■ clents^-^'^I^ si^n^^^^i' data iCR^^l =V^54,cand this^^-^^th^ smallest 



intVscore- 



"Equation ^.3 can have for these data — no matter what the 
ictynlly .is, . .. ■• ■ ! 
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flolont corracttid tor chanco, wh^BJjP^W^^oteti U ecially obtainotl 

uairuj the valueo ot o^iTr)^ and (A); in Equation 6,5 ^in Table 6.3.. For 

Vifeh6 gynthotic data, 't* « .56/a .value very cloao to KR-21 « ,5^.. Indeed, 

^l) and KR-i^l almoat cilwaya have very similar valuea. This occurs prin-- 

cipally because rieither one of them depends on<jiChance agreement, which 

^is technically (li - c )^ for squared error losg. ' 

o 

^ Interpreting agreement coeff icientg . Agreement coefficients (and 
their reliability counterparts) are ^discussed and used extensively 
in educational measurement — perhaps too extensively! However, 
they are frequently difficult to interpret correctly , no matter*' what 
lass function is involved. For this reason, whatever Toss function is 
involved, the following "characteristics of such coefficients should 
be kept in. mind ' ' 

(a) an agreement coefficient generally ranges from 0 to 1, buf- 
a Value of ,£>ay , .30 is not neces^rily "twice as good" as a value of 

■ ... . ' ■" >: ; * / ■ ^ :M ^ 

(b) when ^mcist examinees have observed scores close tb the advance- 

i ■' . ■ ■ 

•ment score, agreement coefficient not corrected for chancis will be 



4^ 



smaller than when most examirffees hdve; observe!^ --Bcores relatively far 



from the advancement score; ^, z^- --^f*, 

(c) an agreement coefficient will tend to ba small whenever uni>- 
verse score variance is small or error variance is large (even if the 



coefficient is based on threshold loss) ; 
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(d) an aqroomqnt coof fc'iciunt not corractwd tm oliiinpe ratiectB 
t\\ti qiuUity (or comets tancy) of: daclaiona made about examiiieea, wheruaa 
an aqreum«nt co^fficUmt oorraotad for ohanca rel'lacta thu coritribut loa 
ot[ tha teat to the quality ol? uuch deciuiona. Thia is another porapectivo 
on the fact that a coefficient corrected for chance ia amaller than its 
not-corx^ected-for-chance countairpart . • ' " 



Th re a ho id Loss 

- y 



>In the introduction to thia section it waa atated that a threahoid 
loss fuuo'tion assumes that all false negative errors are equally serious, 
and all false positive errors are equally serious. 

To clarify' thia point let us^^'uppose that the test length is n = 10, 
and c = TT - .90. ^vioualy, an ex^ine%,^iil not be advanced if he/she 
gets 0, 1, 2, . . 8 i terns cor rei<Jj|^jt/i.^N^ almost certain that 



^'Some of these examinees .will be ft^ll^^^jj^l^ro non-masters, 
because it is lil^ly that son^e pf^^t^^pj|^ 
at or ^^o^ rtever .knows which- ex<fl<]tt 

declared t^" b^^^^^^.^^b^^^^^ it is assi:^^d that any 

such false n^^(j^yi^^ as any other su<^h error, no ? 

matter what ,the examinee's unvierse score actually is; e.g., failing 
an examinee with a universe „^sco re of tt = .91 is as serious an error 
as failing an examinee%itih -^{-/iUniverse score of tt = 1.00. 

Also,, - the threshold lo^s function, involves assuming that all f^lse 

; ' ^ ■ ■ • ' ■ 

positive errors are .equally serious. For the above example, this means 



I, 

that [MMuinii »;iu <3Ki:Mni iiutj with a univtiiwu acoru oi' , nay, u ^-^ .^10 la aa 
jjoivluuH 4n tn-ror parui irvj an uHOt^fiea with a imivurHti Hunt u at;, i^ay , 



alf, th 
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It ahould bo noted, howeve^/ that tha threaholci lass function 
.duids not. luvnivo aauumincj that falaa poaitivo errors are aa acarioua an 
falau noqativo urrorn. That iaaue ia a question ot Iobh ratio — a yub- 
ject treated in Section A . 

Table G.4 doacribes and illustrates the steps required to obtain 
the threshold loss coat ficienta p^ (not ^corrected for chance) . and Kappa 
(corrected for chance) . 

step 1 simply involves recording results already obtained in TablQs 
G.2 arid 6.3 for the synthetic data. 

Step 2' involves computing a z-score based on the advancement score, 
c^. For these data z = . 45 which ^md|^ns that the me^n, x, is 45/100th*s 
of a standard deviation [s (x^) = ^^ffi^r'* above the advdfncement score. 



Step 3 involves determii;ii'hg >fl»^ proportitinVof examinees would 

have z-scores below z = .45 if examinee scores w4re normally distributed' 

To obtain this result, Table A.3.?-in Appendix A i^ required. For 

. * 
the synijthetic data, this proportion is p = .67w-^:>»,t4 ^ • 

Step 4 involves determining the proportion id f T^X'f^^^^j^i would 
have j2-scores below z = .45 on each of two (hypoth'^ticdr)o n-iteiti tests 

if examinee scores were normally distributed on both tests. For: the 

\ J, ' , 

synthetic data p^^ = ;^53. This step makes use of KR72I; ^""^ ^22 ^^^^ 
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A i»tTKunlurn with n litHU^^al-tvu CiMMHMicitlunw tiu lilwtlniat Imj At/iiitijnfciii( cv»u( I loiantw 



in Tiiiilo (i. I) < 



9/10 



P 



to c 



step 4: 



Uiiintj tht) ItiBl column in Tal)l,q, A. l, 
loccito thu row hcwintj' tHt!iH^cvlp^<^st 
Vcrlue to Iho /.-uooro in Step .2» 
F<ocorcl p^~-tho on try to the left' 
ot this ^-score (under the column 
headed |i;00) 



Pindj the columii in Table A. 3 having 
the closest vaiue to KH-21 in Step 1. 
Record p --the entry in this column 
tor the row locateti in Step 3. 



.674 



Using the column headed KR-21 = .55, 



p - .533 
^ zz 



Stesp 5: Compute p^ and kappa 



2 (p 



P ) 

zz 



zz 



Kappa = 



- ' y P P^ 

,r---i^M--^y^- '■■•■-..^ 1? 1.^- ; ■ 

* ^pn^p tep -6^^ expected proportion of 

£a^^ ' ^ ^inconsistent decisions . J 
3» - D 



Kappa = 



2 (.674 - .533) = .72 



,533 - (.674) 



.674 - (.674)' 



= .36 



1 - = .28 



n 



U» I IK" ly i.iJiJMH wiuiu) , 

oi.ua lbt:«iU.Ly uUauititKl, w|,l,U Kapi.ni roCltactj, th,* pro,u)rt Pni nr «Krtmiiu«M 
co,u.b.l:uMt.ly .M...,ul.l:l.nl ..ym; .«ikI h.>Yon.l p,-o,)ort io„ l luU: woivLd pfob^bP. 

h,> ^•laM.Ml.r'liKl .)..nul.;U.,.nt.l.y by chaiuu.. I'l'bo L'i«)P'>i- tion probably 
ol..u;,i.f:M.,ui . oncLiituMtly by ch.iric;o L. L ,, (l - ) , l... .54^ 

C.Qf tht; aynrhottc dcvta.j 

f liiaily, step u in Tabie 6.4 provides an uatimaUu of the propor- 
fclon ot oxamineea who aro Lnconaistentiy clauaifiud, i.e., t:he proportion 
of errors Involvod in the decjiy ion-making process, in the sef^of-v- ' ' 
threshoid loss errors. For the synthetic data, this proportion fS 



.28. 



The procedure for Qstimating p^ and Kappa in Table 6.4 is based on 
the asHumption that examinee universe scores are normally distributed. 
In many domain-refei^tinced testing contexts this assumption is probably 
riot true; but in most cases it is unlikely that violations of t)iis 
assumption will cause p^ and Kappa to be poorly estimated. 

It is im?>.ortant to note that the statistics discussed- above refer 
to a £rou£ of examinees— not t6 individual examiness . None of these 
statistics specify which examinees are consistently or inconsistently 
cic^^if ied . /■ . ' . ' . 



i\t itmm, i0s4ult5j would alimitit: uartainly ilirtttt'. A tiimiUi titdtu-'' 

■j^iluM , uiu'h Ul r<;tiiL'«HCMt4 vmuuU- btigauaa what. w« av« r^c^ily duluu iti 
rtfcU imfirljMj quV^DtpJlCil^a (p^lltid p*i^f^in*^|:*ipj*) thi^C^*^ ^tfivv^t tii)it4ctrvfci 
ill. ly. 

Ue KM 1 1 t: 1 1 i U: a « I ( >ina L 1 1 - r o 1 o t" o n ( j li < t 1 1 j h t I m V I uw o d a u ti a u»i > In < •> t 
i tuimi trom »i largoi" ani vorue ol' Lt:unuj coufeitnictiMl to moaauro thu con-- 
tout: uiulot coiiHlclorat:lon. Mso rocalL that thu uxuminue ajjjorua onu 
Would ideally like to know ara the oxaminue univotiio scorea — i.e., 
oxainlnoo scoi'es on tho univerao ot itema, Thoao Ideal scores can 
novor be ob^^ned; biit, in general, longer teats involve leaa error and 
provide' Uetter estimates of exaiulnee universe scores;. 

Therefore, one obvious question is, '*How long should a test beV'' 
There can be no universal statistical 4d^HP^JiD this question, because 



any specific attempfto ariswer it eventually" involves answering ^at 

; ■ ■ " ~ ■ • ' ■ _ ■(> 

least one other question-~namely, "How much error is one willing to 
tolerate?" Clearly, the answer to this latter question 'necessitates 
subjective judgment by a responsible person who is well-aware of all 
aspects of the testing environment arfd the decisions to be made. Even 
so, statistics can helg in making informed subjective judgments about 
test length . ^ / ' - 



lwi» at i ttt 1 1 ,^ ivtu hd ttci I Illicit «»1 rut 4 hyi.U)t iv a t l oot iir laliglh u \ 
V4t icUU ti 4iun l£» ai aud4IVl tliJVi^t UiMl t:h«> liMmr Wi]M4i tunb 4114, ^ 

i«i|iUr«ii til olJi.4hi lUa pi:Dp>rt loil, Vif uK4mlnui?id iri»u»ii4l4t «ui ly i, l4tjtii r I a*! . 

Hiii.0 ihiii In 'IVtbUi ti.') tit 4( i tir 1 tja f'Di a ic^tit or w ata 

liiulU. i t Itui wltli a |>» iUUi to »llr;it lUHUltilT Muim rr«yit t lu? i ' u » u?nH >l u1 Imj 
^i^at lilt iv.'ii roi Mui ,»v.tU4hlu n it »MU tt^ut , il L.^l im't i on i r j iU«»pi>t$<l 

li\ 'I\ihU» *».n wliit'll uumiunv I /.tsu i niiUli M rot t uiit lumjt hM or n - 10, IS, 
.u\it .M). ('I'lui r 1 1 Mt n>w of •r.il)lw o.n Mijupjy ^lu| 1 1 i i mh t tivnU t m .ilrov\«ly 
iU|K)t t:uU in Talilojj h. \ aiul t>.*l ioi t ho LO-l t um t tnit .) Kr oiu Tablo 
it.^-{ i liuir t:h4t, t:oHt Itjnqt h incutvultui * bot h o (A) .uui 1 - p 

o 

dncrea«ti, but: not vuiry rapidly. in intotptot: iutj (j(A) it: in lUitjrul (o 

koop in mind thai It: iuxt\ bo no lat'qor t\uu\ O.^IS wtuui iuvch obMorv^nl 

ii'om iicore t:akoM on onti v>t: two posniblo values, »4M tti the cawti tor thtJ ; 

the synthetic data in Table f).2. 

Tl\e valueji of o(A) and 1 ^ p roportod in Table 6.6 are based urK)n 

synthetic data# but «5imiiar results can easily occur with real data. 

Furthermore, the values of a (A) And 1 ^ p reported in Table 6.6 would 

♦ , o 

probably be judged rather large in most real contexts. Of course, 
these values can be reduced ipy increasing LestUllength beyond 20 items. 
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n-' (A ) 




1(» > ( Jii 10) ri) 



vol 
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t ill .1 'i'tnit of l.«^iii|th M ( t . *i , ^ I* ) * 

i4iti<i lor av.tiWilWo tMut 

(10 » i' - X ' n 



(•') 
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Nt)t«»: All 'itatiBtic'M foi «i tout of leixitti ri\ato idijtit i I I with <i pr in^» ( to d i 'it i uau i sih t Hf»m i rom 

" , ' ' ! • 

n€3 curt es{X>ndtnq atatiiitic^JI tor the available ir lttsm teat. 
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Table 6.6 ^ / 
Illustrative RQ3Ulta for 'Changes 
in Test Length Using the 
Synthetic Dajia Exait(ple' 



n 


a (A) KR-21 




. 10 


.11 .5f ' 


.28' 


15 


.09 .64 . 


.26 


20 


.08 .70 


.25 










/ 






/ 

/■ 
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In beginnjing the above diacuaaion of teat lerllftih, it was pointed 
out that data, per se, cannot specify what the teat length should be, 

but data can help in making an informed, but atill subjective, judgment 

.. - ' ' ') 

about/.test length^ In this regard, (^(A) and 1 - are helpful; but <' 
it must be recognized that these two statistics provide different types 
of infprmation, and perhaps not equally useful information in a parti- 
cular context". In the extreme, if an investigator were interested only 
in minimizing classification errors, then a (A) would provide irrelevant 
information; and, conversely, if an investigator were interested oniy ■ 
in measurement error, then 1 - p^ would provide irrelevant inforfnation. , 
- The perspective taken above is that, in most- realistic settings, 

' both types of error are likely to be of interest; and, therefore, con- 
sideration has been given to both. Only in a specific context can a 

* judgment be made concerning which statistic is more appropriate in 
;!i ^ . - 

' considerations regarding test length, discussed below, a similar- 

' 'argtuaent applie? to agreement coefficients, 
other , Considerations • ■ ■ 

— ~^ : I * ' 'I 

•Throughout this section, squared error loss and threshold lo^s 
"statistics have been treated in parallel. If, in a given context, an 
investigator has an unambiguous basis, for choosing one loss function 



( 



aver tha^other^ then, of eourae,: atatlatica. involving the other loaa 
eimetion .become, irrelevant. However, in many aituationa, choice of 
a loaa eandtlon may not be a cpmpjetely unambiguouo deciaion and, 
Indeed, it tnay be that neither loss function ia ideal. In^auch situa-' 
tionil,, one approach ia to examine atatiatica for both ^oaa functiona, 
keeping in mind the different aasumptiona involved. In, doing 30/ theire 
■ia some potential for confuaion, but a theoretically better approach 
would involve complexitiea far beyond the intended scope of thia hand- 
. book ; ' " " V ' * 

In this regard, it should be kept in mind that it is not a]^ways the' - 
case that a teat ia used to make a single type of decision. For example, 
^5^?*^^^*^®^^ ^® ^h^^ ^ giveiv^'teat is sometin)es~'used to make mastery/ 
non-mastery types of decisions assuming threshold loss; and, at other . 
times, the test is used pimply to estimate examinee universe scores 
assuming^ square^ error loss. Fpr auc^i ^ test^ both loss functions are 
appropriate depending upon the use of the teat. Indeed, in choosing 
a loss function, the cjuestion of importance is not what constitutes 
the test, but rather what constitutes the assumptlions about the deci« 
siohs to be made using the test. 

Sometimes a domain-referenced test is used solely 'for the purpose 
of estimating examinee u<iiverse scores, without any consideration of 
a cutting score. In such situations (assuming that squared error loss 
.IS relevant), a (A) is still appropriate, 'as is the index * given by 
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Equation 6»*.5 in Tabla 6.3, In thia aenaai * may viewed, aa a general- 
purpoaa agreeillJifsit coe,ef;ic3ient, or index, of dependability, for a domain- 
rofer^aced teat. Note that when a doinain-'refej^enced teat la utied Holely 



to estimate Examinee aniverae adores , threshold loss statist J^ca like 



those treated above -are meanincjleaa, . "-^^ti 

In the fintroduction to thia aefction, reference was made to thb ^ 
fact that t\\p agreement coef f icienta discussed above are sometimes^ 
called relxai)ility coef f iciehts . Actually, these agreeftient coefficients 



cai-ry with them a connotation gf validity, too, in^the sense that they 

■ - . ■ 

invoivcv consideration of, the universe of items wl^ich is often the 

; . I . , . ■ • ' , 

principal "criterion" of interest, or the only criterion available - 

Indeed, one perspective on measurement ^suggests that notions of xeli- 

^ % ' ■ "'4* * 

ability and validity can be blended together into a consideration of the 

extent to Which observed spoa^es are qeneralizable to universe sqores,, 

This perspective seems especially relevant for domain-referenced inter- 



pre^tations of test scores. In this sense, this section ^has considered 
issues relevant to both reliability and validity. 



Appendix A 

• . • ■ * 

l^^.lti A. I .{ii baid^cl on tha Fhan^tT'^WllubH'-'Huynh proc«idvira rafaranaad 
in Appendix This teibla was developed uBing the IMSL (1979) aubrou- 
tine MDBETA. f ' . 

The c'eaulta repprted in Table A. 2 are baaed on the asaumptiona of 
binomial likelihood and a mj^form beta prior (aee Appendix D) . The 
probabilities reported in Taj&l^^A- 2 weVe obtained using the IMSL (1979) 
subrouting MDBETA: and, the credibility intervals were obtained using 
'CADA [Isaacs and Novicki and Jackson (19^74)], and some calculus, 
I'' Table A, 3 was devSloped ufcing the IMSL (1979) subroutines MDBNOR 
ana 'mDNOR; 
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. ORtiEREli rtCCORUlNO TO INTERUAl HUi-pOINTO 



HID POINl •• OiAS 



KIK 1,088 MMm OF It 



1(1 

n (O.OMO) 



0 . m 

0 . 700 
(OilOO) 



III 

(O.OSO) 



' 10 ' TO-" 
0,/fiO 0./7b . 



O.AOO 
' TO 
0.1)00 



TO 

0,, A/ri 



(0. 100) 



(0. ISO) 



<0',200i ' (O.pfiO) 



6 

. 7 

u 

V 

10 

11 

12 
13 
14 
15 

\6 . 
17 

ra 

19 

20 

21 

22 

2'3 

24 

25 

24 '• 

27 

29 

29 

3^;r 



5/ 
6/ 
6/ 
7/ 
0/ 
0/ 



4/ 

5/ 
A/ 
6/ 
7/ 
0/ 
8/ 9/10 
9/10/ip 
.ylO/ i,0/l"l 
lO/U/12 
11/12/12 
12/12/.13 
12/13414 
r3/14/M 
,14/14/15- 
^/15/16 
iVlA/lvV 

)/i:6yi7.<' 
ra/ 19/30; 

19720/.2X) 

iV/soi;{2i„ 

20/|!l/22 . 



4/ 

5/ 
6/ 
6/ 
7/ 
tl/- 
B/ 



7 
7 
8 
9 
9 



f,/ 

6/ 
7/ 
0/ 
tl/ 
9/ 

9/10/10 

10/10/n 

10/11/11 

11/12/17 

12/12/13 
,12/13/13 
U( 14/14 

HyH/Ts 

r4'/157l5' 

KA/i'i/n ■ 

1 6/^1 1/ 17 

18->i9;i'l 
,19/20/20- 
19/20/21 



S/ 
5/ 
6/ 
7/ 
8/ 
0/ 



.0 

7 
8 
9 
9 





RAtIO OF' 2 



t»/ 
6/ 
7/ 
0/ 
8/ 
9/ 

9/10/10 
10/10/1 1 
10/11/,12 
11/12/12 
12/1.3/13 
Ji2/13,,'14 
13/14/14 
14/15/15 
,15/15/16 
15/16V17 

iA/r;/i7 

17/18/m. 

/18/19 
18/.r9./20 
19/720/20 
lv9/20/2'l' 
20/21<,!!22 ■ 
|l/2^/22 
• 22/ 2;^/ 2:1 



S/. 

Y>/ 

A/ 

7/ 

0/ 



!■./ A 
A/ A 
7/ 7 
7/ {)■ 
«/• 9 
8/- 9/' 9; 
9/10/10 
lO/lO/lT 
.10/1-J/lV 
11 /1 2/ 1,2 
12/12/13^ 
12/13/14' 
13/14A14 
14/15/15' 
15/15/lA 
15/1 A/IA 
lA/17/17' 
17/17/18 
,l7/m,/19 
10/ 19 V. 1^9 
' 19/i0/2i) 
•20^20/21 
-al/21/2l 
21/22/22 
^27£'^23 



.S/, 

;5/ 

6/ 
7/ 
0/ 
.8/ 



A 
,A 
') 
0 
8 
9 



5/ 

4/ 
II 
7/ 
8/ 
9/ 

9/l'0/10 
10/10/11 : 
10>/11/11 

,^'t/12/12 ' 
'12/12/13 
13^/13/13 " 
13/14/14 ' 
14/15/J.9L 

. 15/18/lAV 

\i/\im 

17/ 1.7V 1 8 
.'17/18/18 
18/19:><iy 
i9/-19/20,^ 
26^20/3'l 
20 ^5^/2 r 
' 21722/22- 
2 2'/ 2 2/23 



S/ 
, 5/' 
■ A/ 

7/,; 

• 8/- 



7 
■B 

!-8 



5/ 
A/ 
7/ 
7/ 
8/ 
9/ 

, 9/10/1(1 
. iO/lO/ll" 
^ Jl 0/1 1/11 
11>12/13 
/l 2/ 12713 
13/13/13 

- 13-/14/14 
!i/M/14/15( 

mi^'/^lA 
iil5/^A/lA • 

- !ii'/i7/v; 
'*.i7'/lV;8^ 

i;a/i8/ia 

■ 18/19/19 
19/19/20 
• 20/20/'^0 
''.20/21/21 
21/22/^2.2 
. 22/M723 



5/ A/ A 
.A/ A/ 7 
' II 7/ ,0 
7/8/ 8 
0/ 9/ 9 
9/10/10 
10J'107ll( 
10/11/1.1- 

riyi2/i2 

'12/13yi3' 
1^13/14 
irVl4/l5 
M/15/^5 
15/16/l<A 

.lA/l/W-V? 

.16/ii7>lB 
174^^8 

ia/ijrAf9" 



0,700 
TO 
0 . llOO 

<0,100) 



A/ A 
A/ A/ 7 
7/ 7/ / 
■II 0/ 0 
0/ 9/ 9. 
9/ 9/10 
'lO/lQ/U 
10/11/11 
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THE UIDThf OF 
tHE TABLE ARE 
IS APPROPRIATE 



EACH IN.TERVAt IS INIUCATED IN PARIENTHMES 
ADVANCEHENT SCOF^ES FOR LOSS' RATIOS OF iK: 
IF FALSE POSniVE ERRORS ARE TWICE AS SEJ 



BELOW THE LIMITS Of THE INTERS 

AHD 3i' R£SPECTIV|u. FOR EXAhPLEi 4 LOSS 
lOUS AS FALSE, NEGATsIVE -ERRORS. 
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NOiE. the: uipth. of each interval is indicated in parentheses below the limits of the interval. 
i:mtries iM the table are advancement scores for loss ratios of 1. 2f AND 3. respectively, for example, a loss 
ratio of 2 IS appropriate if. false positive errors are twice as serious as false negative errors. 
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'gable A.2 ^xfcbntinued) 



Inferences aboi^ 'Universe Spore Given 
ri and for^ah Examinee 



Probability that n is aX or above 



Credibility Intervals 
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Table A. 2 (Continued) 

inferences about Universe Score Given 
' n and x for an Examinee 



Probability that tt is at or above 



Credibility Intervals 
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( .37, 
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10 
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12 
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( .56, 


.89) 


16 


13 
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95 


90 


80 


65 


45 


24 
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.89) 
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.91) 
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16 


14 
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99 
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75 
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16 
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98 


94 


83 
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10 
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( .48, 
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( .44, 
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( .40, 
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14 
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( .54, 
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( .50, 
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( .60, 


.80) 
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Inferences about Universe 'Score Given 
n and \x for an Examinee 
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Credibility Intervals 
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( .81, 
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. .97) 
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63 


39 
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2 
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( .73, 
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( .69, 


.94) 


20 


18 
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99 


97 


93 
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63 
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8 


( .82, 


.95) 
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.98) 


20 


19 
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99 


98 


94 


84 


64 
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( .86, 


.99) 
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i Table A. 2 (Continued) 

Inferences about Universe Score Given 
n and x for an Examinee'^' 
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21 
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0 


21 


14 


0.67 


71 
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16 


6 
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15 


0.71 


84 


70 
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30 
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4 


0 


0 
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93 
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2 
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17 
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18 
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98 
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84 
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19 
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99 
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94 


85 
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9 
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86 
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99 
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18 
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35 
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4 4 
61 
77 
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9 
1 9 
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( .39, 
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( .44, 
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( .53, 
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( .69, 
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1.00) 
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( .41, 
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( .50, 
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( .60, 
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.99) 


( .87, 


.95, 
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( .48, 
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( .52, 
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( .57, 
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( .62, 
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♦Cf^84, 1.00) 
( . 90, 1.00) 



( .36, 

( .40, 

( .44, 

( '.49, 

( .53, 

( .58, 

( .63, 

( .68, 

( .73, 

(■ .79; 



.68) 

.-72). 
.16)' 

.83) 
^ . 87) 
.90) 
.93) 
.9'5), 
'.98) 



00 

CD 



( .85, i.OO) 



Table A. 2. (Continued) 



n 


X 


X. 




.70 


.75 


' . 80 


.85 
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24 


14 
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24 
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0.71 
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81 
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3 
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24 


20 


0.83 
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97 


91 


79 
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32 


10 


1 
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( .69, 
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24 
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24 
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.94) 


( .74, 
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97 
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46 
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( .90, 
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1 
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0 


0 
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14 
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0 


0 


0 
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1 


0 
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35 
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97 


92 


79 
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Table A. 2 (Continued) 



n 



Inferences about Universe Score Given 
n and x for an Examinee 



Probability that tt is at or dbove 
X .60 .65 .70 .75 .80 .85 .90 .95. ^ 



27 


^ 

14 


o»tr2_ 


19 


7 


9 


0 


0 


0 


0 ' 


0 


27 


1 vJ 


A «=*. A 


"< A 


1 4 


5 


1 


0 


0 


0 


0 


2 7 


1 6 


U ♦ 7 


A S 
*f >J 


^) c 


1 0 


3 


0 


0 


0 


0 


■ ) "7 
/ 


1 "7 
1 / 




A 0 

O V/ 


\J o 


1 9 


7 ' 


1 


0 


0 


0 


2 7' 


1 8 




7 *1 




v> 


1 4 


r 4 


1 


0 


0 


27 


1 7 


A "7 A 


t> J 


A <9 

O 7 


4 7 




•V 9 


2 


0 


^ 0 


2 { 


^ \) 


A 7 4 


O "I 

7 O 


<_) 


6 4 


40 


18 


5 


0 


0 


2 7 


2 1 


/*i "7 Q 


Q 7 


Q 1 

, 7 i _ 


7 J { 


57 


32 


1 2 


2 


: 0 


*u / 




A O 1 
O ♦ O 1 


O O 

T 7 


Q A 


ft 9 

- Cl 7 


7 4 


50 


24 


6 


0 


2 7 


2 »5 






/ T 


Oil- 


86 


69 


4 1 


14 


1 


27 


24 


A a o 


1 A A 


1 Afi 
J. V V 


9 f{ 


9 4 


8 4 


62 


31 


5 


2.7 


2 o 


/"k o 
U • T >1 


1 A A 


1 k^A 


1 V/ V7 


9 8 


9 4 


81 


54 


16 


2 7 


26 


u • y o 


1 A A 




1 OA 


100 


98 


94 


78 


4 1 


2.7 


2 / 


i A A 


1 A A 
1 U U 


1 nA 


10 0 


100 


100 


9 9 


95 


76 


p 
o 


1 A 




1 4 


5 


1 


0 


0 


0 


0 


0 


2M 


U3 


0*54" 


23 


10 


3 


1 


0 


0 


0 


.0 


28 


16 


0.57 


36 


18 


"7 
■ y 




0 


0 


0 


■ - 0 


28 


17 


0.61 


^ 51 


30 


13 


4 


1 


0 


0 


<) 


2B 


18 


0.64 


66 


44 




9 


2 


0 


0 


•0 


28 


19 


0.68 


79 


59 


36 


17 


5 


1 


0 


0 


28 


20 


. 7 1 


88 


74 


' 5 2 


29 


1 1 




0 


0 


28 


2 1 


0.75 


94 


85 


68 


4 4 


2 I 


6 


, 1 


0 


28 


2 2 


0.79 


98 


93 


81 


61 


36 


13 




0 


28 


2;5 


0.^2 


99 


97 


VI 


7 7 


5^ 


26 


6 


0 


28 


24 


0.86 


100 


9V 


96 


88 


7 2 


44 


16 


1 


28 


25 


0.89 


100 


100 


V ■? 


'95 


B6 


65 


33 


5 


28 


26 


0. 93 


100 


100 


100 


99 


95 


8 3 


57 


IB 


28 


27 


0. 96 


100 


100 


1 0 0 


i 0 




9^> 


iJO 


4 


28 


28 


1*00 


100 


' 100 


100 


100 


100 


9 9 


95 


77 




1 

^ \^ 



Credibility Intervals 



67 Percent 



80 Percent 



90 Percent 





. b 1 ; 




.On; 


1 1 1 


fkl \ 


.46, 


.64) 


( .44, 


.67) 


( .40,- 


.70) 


. 50, 


.68) 


( .47, 


.71) 


( .44, 


.73) 


. 54 , 


.71) 


( .51, 


.74) 


( .48,. 


.77) 


.58, 


.75) 


( .55, 


.77)' 


( .51, 


.80) 


.62, 


\ .78) 


( .59, 


.80) 


( .55, 


.83) 


.66, . 


.82) 


( .63, 


.84) 


( .59, 


.86) 


.70, 


.85) 


( .67, 


.87) 


( . 63 , 


.89) 


.74, 


.88) 


( .71, 


.90) 


( .68, 


.91) 


.78, 


.91) 


( .75, 


,.92)' 


( .72, 


.94) 


.82, 


.94) 


( . 80 , 


.95) 


( .77, 


.96) 


.87, 


.97) 


( 184, 


.97) 


( .81, 


.98) 


.91, 


.99) 


( .89, 


.99) - 


( .87, 


1.00) 


.96, 


1.00) 


( .94, 


1.00) 


( .92, 


r.oo) 


.41, 


.59) 


( .38, 


.62) 


( .35, 


.65) 


.45, 


.62) 


( .42, 


.65) 


( .39, 


.68) 


.48, 


.66) 


( .45,' 


-.68) 


( .42, 


.71) 


.52, 


.69) 


( .49, 


.72) 


( .46, 


.75) 


.55, 


.73) 


( .53, 


.75) 


( .49, 


.78) 


.56, 


.76) 


( .56, 


.78) 


( .53, 


.81) 


.63, 


.79) 


( .60, 


.81) 


( .57, 


.84) 


.67, 


.82) 


( .64, 


.81) 


( .61, 


.86) 


.71, 


.85) 


( .68, 


.87) 


( .65, 


.89) 


.75, 


.88) 


( .72, 


.90) 


( .69, 


.92) 


.76, 


.91) 


( .76, 


.93) 


( .73, 


.94) 


. 83, 


.94) 


( .80, 


.95) 


( ,.77, 


.96) 


.87, 


.97) 


( .85, 


.98) 


( .82, 


.98) 


.92 , 


.99) 


( .90, 


.99) 


( .87, 


1.00) 


.96, 


1.00) 


( .95, 


1.00) 


( .92, 


1.00) 



O 



1 < 7 



Table A,2 (Continued) 
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Table A. 3 ' 
Probability that 'IVo Standard Normal 
Variabl'ea, with Correlation Equal to" KR-21, 
are Both Leas Than or Equal to z 



0,20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 O.'&O 0.&5 0.70 0.75 0.80' 0.85 0.90 0.95 1.00 



-1.95 0.002 0.002 0.002 0.003 0.003 0.004 0.005 0.00& 0.006 0.007 0.009 0.010 0.011 0.013 0.015 O.OIB 0.02& -1.95 

-1.90 0.002 0.002 0.003 0.004 0.004 0.005 0.00& 0.00& 0.007 0.009 0.010 O.OU 0.013 0.015 0.017 0.021 0.029 -1.90 

-1.85 0.002 0.003 0.004 0.004 0.005 0.00& 0.007 0.008 0.009 0.010 0.011 0.013 0.015 0.017 0.020 0.023,0.032 -1.85 

-1.80 0.003 0.004 0.004 0.005 0.00& 0.007'0. 008 0.009 0.010 0.011 0.013 0.015 0.017 0.019 0.022 0.02& 0.036 -1.80 

-1.75 0.004 0.004 0.005 0. 006 0.007 0.008 0.009 0.010 0.012 0.013 0.015 0.Q17 0.01.9 0.022 0.025 0.029 0.040 -1.75 

-1.70 0.004 0.005 0.00& 0.007 0^008 0.069 0.010 0.012 0.013 0.015 0.017 0.019 0.022 0.025.0.028 0.033 0.045 -1.70 

-1.45 0.005 0.00& O.OOZ 0.008 0.009 O.OU 0.012 0.014 0.015 0.017 0.019 0.022 0,024 0.028 0.031 0,037 0.049 -1.65 

-1.60 0.006 0.007 0.008 0.009 0.011- 0.012 0.014 0,016 0.018 0.020 0.022 0.025 0.028 0.031 0.035 6.041 0.055 -1.60 

-1.55 0.007 0.008 0.010 0,011 0.013 0.014 0.016 0.018 0.020 0.022 0.025 0.028 0.031 0.035 0.039 0.046 0.061 -1.55 

-1.50 0.009 0.010 0.011 0.013 0.015 0.016 0.018 0.020 0.023 0.025 0.028 0.031 0.035 0.039 0.044 0.051 0.067 -1.50 
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-1.30 0.016 0.018 0.021 0.023 0.025 0.028 0.031 0.034 0.037 0.041 0.045 0.049 0.054 0.060 0.066 0.075 0.097 -1.30 

-1.25 0.019 0.021 0.024 0.026 0.029 0.032 0.035 0.038 0.042 0.046 0.050 0.055 0.060 0.066 0.073 0.083 0.106 -1.^5 

-1.20 0.022 0.024 0.027 0.030 0.Q33 0.036 0.040 0.043 0.047 0.051 0.056 0.061 0.066 0.073 0.081 0.091 0.115 -1.20 
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-0.85 0.056 0.060 0.065 0.070 0.075 0 .080 0.086 0.092 0.098 0.104 0.111 0.119 0.127 0.137 0.148 0.163 0.198 -0.85, 

-0.80 0,06^^068 0.073 0.078 0.083 0.089 0.095 0.101 0.107 0.114 0.122 0.130 0.138 0.148 0.160 0.175 0.212 -0.80 
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-0.50 0.121 0.127 0.134 0.141 0.148 0.156 0.163 0.171 0.180.0. 188 0.198 0.208 0.219 0.231 0.245 0\264 0.309 -0.50 



Table h.3 (Continued) 
Probability that Two standard Normal 
Variables I with Correlation Equal to KR-21, 
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Table A. 3 (Continued) 
PrlobabiXity that Two Standard Normal 
Variables, with Correlation Equal to KR-21, 
are Both Ifeaa Than or Equal to z 



KR-21 
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- Appendix U 
TacKnical Notttii 

These notes are provided for two reasons i (a) to cite appr6priate 

I 

technical background and reference's for each sejjtion of the handbook? 
and (b) to provide a limited amount of technica). justification for 



equations and/or procedures tha^ are not specif jLcally reported in readily , 
available references. However, I there is no intent to' cite a^l potentially 
relevant references or to verify in detail all equations and/or proce- 
dures. 

In the body of this handbook^ distinctions have been drawn only 
very rairely between parameters and estimates of parameters. In these 
technical notes such^distinctions eire made through the use of a "hat" 
{^) above unbiased estimates of parameters, wl^ich are denojied by Greek 
letters. The reader should be C5u:eful not to confuse this use of a 
"hat" with the use already made of this symbol in the body of the hand- 
book. Specifically, the "hat" symbol is also used to distinguish be-, 
tween tfie sample variances s^ and 's^, where the former involves a denom- 
inator of II and the latter involves a denominator of ii - 1. (Of .course, 
s is ai^ unbiased estimate of a parameter, but usually not a parameter 

•0' 

of interest, here.) 
Section 1^ 

Berk (1980) provides an edited book of readings on the subject of 
domain-referenced (or criterion- referenced) measurements. Most of the 



w»:Lut:uii \)i Uuil^iuxily l ot i>t£uH, i I l,oli«t ti > ^mi Nll:ko ( loao) i:j^viawti t.h«» many 

that tli«re aire clear dit?fwrttnctttt ^)<p>tw«en thla h**ndibQok and tha above 
t'fjtttnjuceti — (lLt?t?ur«n(:«a la umphawiiA and scope, wall oocaflioiuil 

r 

dlf fofoncaei in ptjrHpactiva and apprjjjiach. 

Many InttoductorY mada^^enii^nt^'t;extbooko give conaiderabla attention 
to dofinincj objectives and specifications. Recently, Ellia 

and Wult'eck (1979) and Kills* WuL^c^^ j^nd Fredericks (19i'9) have devel- 
oped a task/content matrix fdr apecifip use in Navy training that in- 

volves domain- ref erenced ^»^ing. 

' " ' *■ 



Section 2 



Most intr9ductory measurement i^xtbooka provide detailed dispussion 
of item analysis procedures. Evei/ t^feugh such discussions usually empha- 
size norm-referenced testing, many^'Of the guidelines typically su^g^sted 
are relevant for domain-r referenced testing, too — with one notice^le 



exception. In the opiriionlb#,jthis author, it is not generally a ^ticd 
practic^L in^ domain- reference^ testing to select items in a systematic ^* 
manner so as to obtain some pre-specif led distribution of item difficulty 
levels and/or discrimination indices. More specifically/ this is tiot a 
good practice if a test is to be used solely for the purpose of maJcing 
domain- referenced interpretations of test scores. 
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on thiti indttx. 



amotion 3 

i 

The procodura tiUijtjo*it«d in a©at:lon 3 for aatablltihlng ^ cutting acore 
in a alitjht modification of a procedure originnlly proposed by Angoff 
(.1971); and tha deveiopmenta involving u(y) are diecuaaed by Brennan 
ai)d Lockwood (1980). The apecific equations for 0(y) in Table 3.2 can 
bu derived in the manner outlined below. 

Let the prob«d5ility assigned by rater r (r-1, 2, t) to item i 

(iwl, 2, m) for a set of m items be: ' ^ 



^ri 



A + A + + A 

r i ri 



where A is the grcuid mean and the A'^i ate score effects as discussed by 

Brennan and Lockwood (1980).* It can be shown that unbiased estlimates 

i 

of the variance of these score effects, in terms of the sample statis- 
tics reported in Table 3.2, are: i 



a2(ri) = [Z s?(yj.^) - t S2(y_)]/(t-i) 



(Bl) 



52 (r) = s2{y^) - o2(ri)/m 



52 (i) = 3^{y^) - o2(ri)/t 



(B2) 
(B3) 



ICS 



ti^(y) - — ^ ■ — t ^ 

I; n r>t 



(H4) 



Um Uhi ,lCviuat:lnim ^ o H V in 1*4 w« obtain 



M- (y^) 



a^(y) 



mt (t~U 



whera tho bracketed term in Equation U5 is a^(ri)/tm, which t^Dnatitut^iH 
the A-term defined ii\ Tabie 3.2. The square root of Equation B5 is 
Equation 3.2 in Table 3*2; and when n equals m, the aquare root of 
Equation B5 is Equation 3*1 in Table 3.2. 

Finally, as n > ^, it is evident from Equation B4 that 

and' using Equation B2, 



a2(y) = r.^(Y^)/t - o2(ri)/mt 



s2(y )/t - a' . 
r 



(B6) 



The square root of (y) in Equation 86 is Equation 3.3 in Table 3.2. 



7 

^ ^ J 



by t'lma^di rtiui iia^toil by WiUxm (iuVn). it whouUl Li** iKito»1| 

howi>viir, tU^v^ whinr** Huynh trtlkw Mhaut Mm loum r^tio thli aut;hor 

an If4l«tt ntnj<itiv« tttioi-tt, Huynh wi^yti th« lot*« t^tlu im Q *10^ *mtt In 
iiection 4 t:hU iofciii t:tktio id lituntirittU aaS I/. SO Of iJouri*«, thU 

dirfMr«nu© ia almpLy a ciuontlon at t,1«f initlon. 

It Iw Jiiutjt/tjMtmi In Sdcjtion 4 tJmt a uontld*inc«i intcirvaJ tux: n 

o 

from *i cutting wcdre ritudy ba conalderad ona ponfiilbU w<ty to d«ifiii« 

an Indifference zona. In dointj ao, it might be argued that one ia 

♦ 

implicitly violating the aa^umption of 0 I referral loaa, which ia 
an atiHumption made by Huynh (1980) in hU formula tW>>f the minlmax 
proceduru ua«d to qotierate Tabla A. 1. Another appioach that might 
be considered ia to eliminate the indifference zone and use, y and 
o(y) from a cutting score study to establish an ogive-shaped referral 
success function, but this is consider2d3ly mpre complicated than the 
approach taken in this handbook. 

Section 5^ 

With respect to technical issues. Section 5 is based principally on 
Table A. 2 which was deve^8i|)ed under J;^e assumptions of a binomial like- 
lihood and a uniform beta prior di^ttjfibution for tt (sometimes called 
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i, 



\ 



rj^vo.u r Luqilsly , i*a oiU iy la ih« l^fi Ii«in4 i^a^ i 'r«si(a«j h -.J i», 
(M ^ nhw H ) - i - I TH « I, It H ^ I) i 

TaUI^ Av ^* ^4 n<*y»iii4n i!t«i*ti|^»U U y liu«*ivai ri)iL ir uut1«ii thu 4iai«ti4iti|»t too 

i.iMfc>a hiijhwttt il^nwlty r«*ijh»oifi, Cumuw ttiWjhi t|\iaiti«L wltli ^;alUu.| an 
Intdival fi l\iL|h^Ht tl^tmlty tdijlun wlii»n ti K.) U»mt0t»i unfamiliar with 
I htiHf* HAy«tiiai\ or>nt-*«i>ti« cau uDtiiiult Novlck ^iiul J^uknou ( ]/i /4, Cimpltti *k ) 
A prtncinal ttsai^on for; iiiiin<j a Imta ptlor h«ir« iw that thtw itttMUmp- 
tiiui tttauUai la 4 BayM«laa t:r<«dlblUty Intarvnl, wtUol^ ^rmbUa oa<* to 
make prolxibl Illy HtAttimtintH about th« i)ar«m^t«i' n, Uy coatiAUt, a 
confLd«iric« Interval allows oau to iiuiku ptob^iiiiiity utatumoata about 
Intervals cover imj it. Some might artjuo that in apaclfic coatoxta, a 
uniform bota prior is frequently unrealistic because a decision-maker 
may know a great deal about an examit\ee. However, to assess "informative" 
(i.e., non-uniform) beta priors in a decision-making process virtually 
necessitates an interactive computing system such as CADA (Isaacs and 
Novick, 1978) . Furthermore, a decision-maker would need to justify the 
specific "informative" prior chosen in each and every individual case. 



i-1 



lui 



iir a)f4mii(«icsci i^aaiiia Ulghlv uiu «4ll«l tki tti t h I «» 41U h»»K « ilho uU^jhl 4i *iii(i 

I ri4t 40 tUfiUlM**! i Vo |>«< 4 piloi tijviht h«» ucitSit, luii , 44 ltli1U4l«i»t ^ V » 

l\\ti p^4»ir«j»4 or 'lol llU 40 l4 r4» rii)lU ll lVl4t 41I1I t i4<ii ly lirtyoiUt t h« tfi U^ici 

t »r t h ii!4 li4ntUuH >K, . ) 

Tho t hworol li a I fi4Hi«W4uk wamsi in :icn*t tun u foi i nt 4t I ivj ?iqu4t «tt 
tuioi Umu 4iu1 thr««!u>lit lojiw rtppro4t:litisi Im pnwidi^a l»y K^tio 4rut lUiiimnn 
(l*)HO), In 4tUUMotw 4 i OnM ia«t«hltt niimh<#i^ i>r t*<tp<^i ii hAVtJ hdwu imb I talifia 
tluU invoivti const lii^tfit Lou t>f t)n« U)«« fnnotti>n 01 t hci oihtit, 

attj r<itlMvmit i (a) Hajni)L«t:on and Novlok (I'i/)) ptovirlod tho flrat Inttti- 
i/rated t;:reatment ot' thr««hoid losa and domain-roforenccHl tcmt:in<| iauuea; 
(b) Swaminathan, HdiTU:)leton, and AUjtna(l974) auqqeated uainq coefficient 
Kappa; (c) Huynh (1976) and Subkovlak (1976) provided procedures for 
estimating threshold loss coefficients based on a single test; and id) 
Subkoviak (1980) has reviewed much of the work in this area. 

Concerning squared error loss, the following publications, among 
others, are relevant: (a) using classical test theory assumptions, 



1 ^'"^ 



m 

1 



niafoi.'^^ tilt S3 »ac4ciiiM, the* klo $ t v 5* t I t »na kif t hc»»«3-^aii (a c^d a 1 ^fo 

( t-l , -4 . ' * , li) titli 

itftmfij vt '\* in t ho ?»u:<)r« f»ff<n:t for f)fUM()n p (it u ^ " v,) ^ t h** 

l> l> i 

v:<. J t <rj f f u t f ( ) t 1l t erti 1 ; n < I rt B i ^ t ho <^ f f a c t f ( > r t ) ui i n t b t a c 1 1 on 

pi 

o t pe r s on p a i id i tern i , wh i c h 16 con t o uiuied Wi t h e xpe r i me n t a i e r i o t . 
( S e o B r e r\ ru^ n a i id Kane, 1977 a , f < ) r mo r*« detail,) 

It is weil^-known that an unbiased estimate of o'^(tt) is: 

a*^{TT) « [MS(p) - MS{pi)]/lc (B7) 



/ ''1 
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whore "MS" is "mean square"; and, it is relatively easy to show that, 
for dichotoroous data, Equation B6 can be expressed as Eq[uation 6.1 in 
Table 6.3, In a similar raamner, it can be showa that 



52 (0) 



n[k s2(x.) + s2(x ) - x(l-x)] 
(n - 1) (k - 1) 



(B8) 



and 



a2(7Te) = 



n k [x{l-x) - s2(x^) - s2(x^)] 
(n - 1) (k - 1) 



(B9) 



Now, 



52 (A) = [o2(3) + 52(7r0)]/n 



(BIO) 



and replacement of Equations BB and B9 in BIO gives (after simpli- 
fying terms) Equation 6.2 in Table 6.3. 

Bf?iennan and Kane (1977a) report that a consistent estimate of 

<I>(c ) is; 
o 



0(c) = 1 
o 



n-1 



x(l-J) - s2(x ) 
P 

(x-c )2 + s2(x ) 
o p , 



(BID 



= 1 



[x(l-x) - s2(x )]/(n-l) 



(x-c )2 + s2(x ) 
o p 



(B12) 



The numerator of the term in braces is simply {L) given by Equation 6.2 
in Table 6.3; consequently, Equation Ell can be expressed as Equation 6.3 
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in Table 6.3. [Technically, (L) in Equation 6.3 should be a^(A); but, 

as previously stated, notational distinctions between parameters and esti~ 

mates are not made in the body of this hcmdbook-] Equation 6.4 follows 

from the fact that *(c ) equals KR-21 if c = x (see Brennan, 1977). The 

• o o 

expression for KR-21 in Equation 6.3 m^y appear strange because it invol- 
ves a^(A), but it is easily verified that this expression is algebra- 
ically identical to the well-known expression for KR-21. 

The steps provided in Table 6.5 for obtaining estimates of thres- 
hold loss coefficients of agreement are based on Huynh's (1976) normal 
approxmimation procedure- (see, also, Subkoviak, 1980) , without using 
an arcs ine transformation (see Peng & Subkoviak, in press). In Table 6.5 
reference is made to using the "closest" value in Table A. 3; alternatively, 
one can obtain better estimates using linear interpolation (see^guynh) 
1978 — different context, but same process) . Huynh (1978) provides a 
computer program for estimating threshold loss coefficients; as well 

as tables of estimates of p , Kappa, and their standard errors for 

o 

test lengths of 5 to 10 items (see, also, Huynh & Saunders, 1980). 

Since the procedure outlined in Table 6.4 is based on a normal approx- 
imation, estimates obtained using this procedure may be somewhat biased. 
However, the degree of bias is likely to be small unless n is quite 
small and/or c is quite close to one. 



In Table 6.5, Equation 6.6 is simply [0^(6) + a^(TT8)]/n^; and the 
remaining equations and steps constitute a somewhat ad hoc approach for 
using Huynh's normal approximation procedure to estimate the proportion 
of inconcsistent decisions for a test of length n"*. 

Bj^ennan and Kane (1977b) show that a^(A) is algebraically equal to 

the average of the squared values of a(A ) in Table 5w4. Note also that 

P J 

a (A) is identical to Lord's (1957) formula for the standard error of 
p 

measurement of an examinee's mean score. 
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